Making Blameless Post-mortems Work Well

A man pointing a finger, looking stern

Introduction

The practice of holding a blameless post-mortem (or incident review) after there has been a production outage has become commonplace in IT in recent years, but is this because we are really learning valuable things about our processes and our systems, or has it just become a habit? How and why can we get real value from blameless post-mortems?

What is a blameless post-mortem?

A post-mortem (or incident review) is simply the process of trying to understand what happened during an incident in production, and the circumstances that led up to it. It is usually just a meeting where everyone comes together to talk about what happened during an incident, why it happened, how the team responded, what can be done to prevent similar things from happening again in the future, and how to improve future responses when they do.

The idea of a blameless post-mortem is to do all of that without apportioning any blame.

Google gives a very good definition of what a blameless post-mortem is in their SRE handbook.

“A blamelessly written post-mortem assumes that everyone involved in an incident had good intentions and did the right thing with the information they had. If a culture of finger pointing and shaming individuals or teams for doing the ‘wrong’ thing prevails, people will not bring issues to light for fear of punishment.”

This definition succinctly captures the two most essential elements of an effective post-mortem. First, it is vitally important that we always assume that everyone made the best decision they could based on the information they had available to them at the time. Second, we must ensure that post-mortems are conducted in an environment of psychological safety so that people feel safe to speak up, and be open and honest about what happened. 

Why we hold post-mortems

Now we have a definition of what a blameless post-mortem is, but why should we hold post-mortems in the first place?

Post-mortems are about learning

The main reason for holding a post-mortem after an incident is to learn. Dave Rensin  described incidents as “unplanned investments where all the costs are paid upfront”. That upfront payment is in time, in productivity, impact to customers, in goodwill, in revenue. The return that we get on that investment is the learning that we extract from it, and post-mortems are the process that we use to extract that value - value we’ve already paid for.

Incidents are messy

Incidents may be great opportunities for learning but they’re also messy. We like to try to make them tidy and linear, and pinpoint not only the one root cause, but also the precise moment they started and ended, with a nice neat timeline on which all the events fit in an orderly fashion. But the reality is usually very different. It can be hard to say when a problem with a system first started and at what point does a problem become ‘an incident’. What about all those near misses? Near misses can be just as important to reflect on than actual incidents: think of those times when we begin to see a problem with a system only for it to be resolved by an observant engineer or for it to (apparently) magically resolve itself. There is a great learning opportunity here without any of the cost associated with outages in production/live. 

Asking good questions

To make sense of an incident and generate valuable learning from it isn’t always easy.  We have to put some effort into understanding what happened and to do that we have to have open and honest conversations and ask lots of questions. 

Using a technique such as the ‘Five Whys’ is a good start, but we need to broaden our perspective and also consider how decisions were made - John Allspaw actually suggests that a better approach is the ‘Infinite Hows’. He also says that we should ask questions that attempt to capture and make sense of that hidden expert knowledge that often comes into play when we respond to incidents. For example, what tricks or shortcuts do people use when responding to incidents, and are there logs or dashboards that people dismiss or are suspicious of?

Why the blameless bit matters

There are a number of reasons why the blameless aspect of post-mortems matters: psychological safety, avoiding killing curiosity, and a whole-systems approach.

Feeling psychologically safe to speak up

As Dr Amy Edmondson notes in her book ‘The Fearless Organization’ (see our review of ‘The Fearless Organization’ here), we need to foster an environment and culture of psychological safety in our workplaces so that people feel able to speak up and say what’s on their minds, and also contribute their ideas, without fear of judgement, ridicule, or reprimand. 

If people are worried about being blamed and shamed during a post-mortem, they will be less likely to bring issues to light which is the opposite of what we want. We want people to feel safe to speak about problems they’ve seen or worries they have so they can be investigated and addressed before they become the next customer-impacting incident. 

Blame kills curiosity

Not making people feel bad is obviously very important but another reason that we need to avoid blame is that it kills curiosity. If you think that an incident was someone’s fault, even if you don’t do or say anything to blame them, then you will stop being curious about what happened and you will stop asking questions. We need to be genuinely curious and keep asking lots of questions otherwise we won’t maximise the learning and value that it’s possible to get from incidents.

It’s just too easy to just blame ‘human error’ for failures and for incidents. Sadly we still see this happen far too often - you don’t usually have to look very far or very hard to find cases where an incident is blamed on the actions of an individual or team. Two such examples from earlier this year are when HBO Max publicly blamed ‘the intern’ for sending out an empty test email to thousands of their customers, and the Salesforce employee who was blamed for taking out their entire website for over 5 hours by making a configuration change.

Take a systems view

Blaming individuals in this way is not only wrong, but it’s also short-sighted. Whatever action someone takes, they can only do so if the system enables that action to be taken. We have to move past blaming people and assume that they were doing what they thought was for the best and think about how we can improve the underlying system to prevent similar things from happening again in the future.

Also, we shouldn’t forget about the people side of things - our systems are sociotechnical systems, after all. Incidents and how we handle them have the potential to reveal a lot about our policies and processes, and even about how our organisation is structured and how it functions.

Nora Jones has spoken about how we can use incidents to gain these deeper insights by looking at such things as who is on-call for incidents (junior or senior engineers), and what kind of training and support people receive. She also suggests that we can learn a lot by taking a step back and looking at patterns of incidents within our organisation.

How to run a good post-mortem

This is my guide for how to run an effective post-mortem. I’ve drawn inspiration from many sources including numerous talks by John Allspaw and by Dr Richard Cook, the ‘Debriefing Facilitation Guide’ from Etsy, Google’s SRE handbook, a couple of articles from Atlassian, parts of The Unicorn Project by Gene Kim, and my own experience in facilitating post-mortems over a number of years.



Find someone to facilitate

Ask someone not involved in the incident itself, an independent third-party, to facilitate the discussion. You need someone who can take a step back and ask good questions to tease out details that those who are closer to what happened sometimes don’t think are important. 

Find someone to act as scribe

You want your facilitator to concentrate on facilitating. It’s extremely hard to really listen to a conversation, pay attention to what people are saying (and sometimes what they’re not saying), and frame the right next question to ask if you’re also trying to take notes. So find someone else who can do that and act as scribe for the discussion.

Don’t wait too long to hold a post-mortem

Try not to leave too long a gap between the incident itself and when you hold the post-mortem. Appoint a facilitator and give them a few days to schedule the meeting and do some preparatory work, but if you leave it too long, people will start to mentally move on and start to forget the details of what happened.

But don’t rush into remedial action

Too often those attending a post-mortem seem to think that the main objective of the meeting is to come up with a list of follow-up actions to be taken to fix the underlying issue. Try to avoid this. Only come up with actions if it actually makes sense to do so – there’s no requirement or rule written down somewhere that says that after every post-mortem some part of the system in question needs to be ‘fixed’. The value is in the conversation, the learning and the understanding of the system that comes from that conversation. Focus on the learning.

Separate learning from actions

It’s recommended that you separate the discussion and the action brain-storming parts into separate meetings and allow a bit of ‘soak-time’ to happen between the two so that people have time to really take on-board the learning from the discussion. You may also want different people in the two meetings. For example, you might want to bring in some subject matter experts when it comes to discussing any changes that need to be made.

It’s all just work

Any work and actions that do come out of post-mortems need to be added to teams’ normal backlogs of work – not recorded on a wiki page or word document somewhere, and not added to a special and separate Jira project. It needs to be in with a team’s normal work so it’s visible to them and so it can be prioritised alongside everything else they have to do.

Post-mortem meetings should be open to all

Post-mortems should be open to anyone who’s interested and wants to attend. That of course means you need to publish the details of when and where post-mortems are happening, along with some information about what the incident was.

Be in no doubt, if people are choosing to come along to your post-mortems, that’s a good thing. Take that as a sign that you’re doing them well, that people are interested and that they’re learning things.

Write up and publish your findings

After the meeting, write up the discussion and findings and publish the write-up for anyone to read. This should be in a central location, not in individual team spaces - everyone should know where to go to find good write-ups that are relevant to their current problem or investigation. It’s all about turning team learning into organisational learning.

Near misses

Hold post-mortems for ‘near-misses’ as well as ‘full-blown’ incidents – there’s still lots you can learn and you might prevent the same situation occurring and becoming a full outage in the future. It can be a fuzzy line sometimes between an almost-incident and what we decide is the real thing and those near-misses are excellent opportunities for learning.

Get creative

Get creative with the questions you ask. Don’t just go through the motions of ‘what went well’, ‘what didn’t go so well’, ‘what can we do better next time’. Really try to understand how decisions were made, what information those decisions were based on, and who made the decisions. Take a step back, look at the wider system, look for patterns.

Learn and improve

And finally, don’t forget to retro your post-mortem process from time to time. Take the time to reflect, evaluate, and look for ways to improve.

Summary

Post-mortems are opportunities to learn and get value from incidents in production - value that we have already paid for upfront; they are opportunities to improve our understanding of our systems, both social and technological.

For our post-mortems to be effective and work well, they need to be blameless - so that we stay curious and ask good questions, and so that people feel safe to speak up. Being blameless really does matter!


This article is based on a talk given at DevOps Manchester in September 2021. The recording is available on YouTube, and the slides are available on Speakerdeck.

 
 
Sophie Weston

Sophie is a Principal at Conflux and has worked in tech for nearly 30 years as a software engineer, DevOps advocate, and now as a consultant. She is interested in systems thinking and organisational design, and her mission is to make tech a better place to be and to work. She's an Ambassador for Women in Tech York, and a co-organiser of DevOpsDays London and Fast Flow Conf.

Previous
Previous

Psychological Safety: A Tailwind For Success

Next
Next

Making the most of your people: key takeaways from the book ‘The Fearless Organization’