My ops postmortem template tried to elicit breadth. Once you’ve got a forest of ...

jwatte · on May 20, 2018

5y is about finding ways to prevent the conditions that let incidents happen, not just preventing incidents, tough. That's kind of the point of asking 5 why questions. "We fell over because we lost a database server and didn't have enough spare capacity to run peak load on the standby. We don't have full N+1 because it hasn't been funded. It hasn't been funded because the business didn't have a good model for the risk adjusted cost. Action item: add the cost of this outage to the next budget forecast, and add a requirement for risk adjusted cost estimates to all future financial plans."