My ops postmortem template tried to elicit breadth. Once you’ve got a forest of causes, you can apply Five Whys to add depth.
There’s some overlap among the following questions. The intent is to elicit observations and ideas, not to uniquely categorize them.
* What are all the factors that could have prevented the incident?
* What are all the factors that could have detected the issue before production?
* What are all the factors that could have detected the issue sooner when it did occur?
* What are all the factors that could have accelerated mitigation? (Including, especially, changes that could have reduced the risks of mitigations considered too risky to apply.)
* What are all the factors that could have accelerated remediation?
* What could have reduced the scope or impact?
It’s common to come out of this with a laundry list that overfits the last incident and, if applied, would increase the complexity of the system and add risk. We’d typically apply one or two fixes, and stockpile the rest to see if any of them would have addressed any future incident. Usually most of the “solutions” turn out to be specific to the single incident that prompted them.
5y is about finding ways to prevent the conditions that let incidents happen, not just preventing incidents, tough.
That's kind of the point of asking 5 why questions.
"We fell over because we lost a database server and didn't have enough spare capacity to run peak load on the standby. We don't have full N+1 because it hasn't been funded. It hasn't been funded because the business didn't have a good model for the risk adjusted cost. Action item: add the cost of this outage to the next budget forecast, and add a requirement for risk adjusted cost estimates to all future financial plans."
There’s some overlap among the following questions. The intent is to elicit observations and ideas, not to uniquely categorize them.
* What are all the factors that could have prevented the incident?
* What are all the factors that could have detected the issue before production?
* What are all the factors that could have detected the issue sooner when it did occur?
* What are all the factors that could have accelerated mitigation? (Including, especially, changes that could have reduced the risks of mitigations considered too risky to apply.)
* What are all the factors that could have accelerated remediation?
* What could have reduced the scope or impact?
It’s common to come out of this with a laundry list that overfits the last incident and, if applied, would increase the complexity of the system and add risk. We’d typically apply one or two fixes, and stockpile the rest to see if any of them would have addressed any future incident. Usually most of the “solutions” turn out to be specific to the single incident that prompted them.