I understand it. I've worked in AWS, and now in OCI, dealing with systems that affect hundreds-to-thousands of customers, which businesses are at stake.
Mitigation is your top-priority. Bringing the system back to a good shape.
If there needs to be follow-up actions, take the less-impactful steps to prevent another wave.
If there was a deployment, roll-back.
My concern here is, a deployment have been made months ago, and many other changes that could make things worse were introduced. This is the case. The difference between taking an extra 10-20 minutes to make sure everything is fine, versus taking a hot call and causing another outage makes a big difference.
I'm just asking questions based on the documentation provided; I do not have more insights.
I am happy Stripe is being open about the issue, that way many the industry learns and matures regarding software-caused outages. Cloudflare's outage documentation is really good as well.
> My concern here is, a deployment have been made months ago, and many other changes that could make things worse were introduced.
Make every bit of software in your stack export as a monitoring metric it's build date. Have an alert if any bit of software goes over 1 month old. Manually or automatically re-build and redeploy that software.
That prevents 'bit rot' meaning you daren't reduild or rollback something.
In a lot of environments this is a terrible idea. In private environments exposing build manifest information is a good idea, but not so that you can alert at 1 month. Where I work, software that's 2-3 years old is considered good - mature, tested, thoroughly operationalized, and understood by all who need to interact with it on a daily basis. Often, consistency of the user experience is better than being bug free.
Mitigation is your top-priority. Bringing the system back to a good shape.
If there needs to be follow-up actions, take the less-impactful steps to prevent another wave.
If there was a deployment, roll-back.
My concern here is, a deployment have been made months ago, and many other changes that could make things worse were introduced. This is the case. The difference between taking an extra 10-20 minutes to make sure everything is fine, versus taking a hot call and causing another outage makes a big difference.
I'm just asking questions based on the documentation provided; I do not have more insights.
I am happy Stripe is being open about the issue, that way many the industry learns and matures regarding software-caused outages. Cloudflare's outage documentation is really good as well.