Having only ever seen one major outage event in person (at a financial instituti...

roganartu · on June 7, 2019

I used to be an SRE at Atlassian in Sydney on a team that regularly dealt with high-severity incidents, and I was an incident manager for probably 5-10 high severity Jira cloud incidents during my tenure too, so perhaps I can give some insight. I left because the SRE org in general at the time was too reactionary, but their incident response process was quite mature (perhaps by necessity).

The first thing I'll say is that most incident responses are reasonably uneventful and very procedural. You do some initial digging to figure out scope if it's not immediately obvious, make sure service owners have been paged, create incident communication channels (at least a slack room if not a physical war room) and you pull people into it. The majority of the time spent by the incident manager is on internal and external comms to stakeholders, making sure everyone is working on something (and often more importantly that nobody is working on something you don't know about), and generally making sure nobody is blocked.

To be honest, despite the fact that it's more often dealing with complex systems for which there is a higher rate of change and the failure modes are often surprising, the general sentiment in a well-run incident war room resembles black box recordings of pilots during emergencies. Cool, calm, and collected. Everyone in these kinds of orgs tend to quickly learn that panic doesn't help, so people tend to be pretty chill in my experience. I work in finance now in an org with no formally defined incident response process and the difference is pretty stark in the incidents I've been exposed to, generally more chaotic as you describe.

opportune · on June 7, 2019

Yes this is also how it's done at other large orgs. But one key to a quick response is for every low-level team to have at least one engineer on call at any given time. This makes it so any SRE team can engage with true "owners" of the offending code ASAP.

Also during an incident, fingers are never publicly/embarrassingly pointed nor are people blamed. It's all about identifying and fixing the issue as fast as possible, fixing it, and going back to sleep/work/home. For better or worse, incidents become routine so everyone knows exactly what do and that as long as the incident is resolved soon, it's not the end of the world, so no histrionics are required.

matwood · on June 7, 2019

> fingers are never publicly/embarrassingly pointed nor are people blamed

The other problem is that it is almost never a single person or teams fault. The reality is that it is everyones fault, and as soon as people accept that they can prevent it in the future.

Lets take a contrived case where I introduce a bug that floods the network with packets and takes down the network. Is it my fault? Sure. But what about pre-deployment testing? What about monitoring - were there no alarms setup to detect high network load? What about automatic circuit breakers that should have taken the machine offline, and instead let a single machine take down the whole system?

The point is that blaming the person who introduced a code bug is lazy, and does nothing to prevent the issue in the future. When a failure like what happened at Google occurs it is an organizational failure, not a single person or team. That is why blaming people is generally not productive.

ethbro · on June 7, 2019

I've only been tangentially pulled into high severity incidents, but the thing that most impressed me was the quiet.

As mentioned in this thread, it's a lot like listening to air traffic comm chatter.

People say what they know, and only what they know, and clearly identify anything they're unsure about. Informative and clear communication matters more than brilliance.

Most of the traffic is async task identification, dispatch, and then reporting in.

And if anyone is screaming or gets emotional, they should not be in that room.

tetha · on June 7, 2019

Someone at our place recently commented that the ops team during an incident strongly feels like NASA mission control in critical moments[1]. I wanted to protest, but that's surprisingly accurate.

> And if anyone is screaming or gets emotional, they should not be in that room.

If someone starts yelling around in my incident war room for no reason, they get thrown out. I'm a calm and quiet person, but bugging around during a major incident is one of the few things that make me mad.

1: https://youtu.be/Y0yOTanzx-s?t=3059

di4na · on June 7, 2019

It is not surprising at all. Mission Control was forged in the fire (literally for Apollo 1) and they are one of the most visible "incident team" we know about.

I highly advise to read Gene Kranz memoirs "Failure is not an Option" if you work in that kind of environment.

ethbro · on June 7, 2019

I heard recently that he never said that.

Apparently it was mentioned when the Apollo 13 script writers were gathering stories at NASA, they liked it, and then gave it to the Kranz character.

Who then decided, "Hey, if everyone thinks I said it..." and titled his memoir.

di4na · on June 7, 2019

Yep exactly.

Moru · on June 7, 2019

When the insident is over, does it look like 55:50 in that video? :-)

tetha · on June 7, 2019

We recently had a 15 month long project almost fail due to some stupid shit and a really wonky error no one understood so far. <Almost fail> as in "Keep the customer on the phone we have 3 possible things to try and don't hang up! No one leaves that call until I'm out of hacks to deploy!" That evening we had the entire ops team in the houston-mode for several hours.

And yes, once we had a workaround in place the customer accepted, we reacted like that. Except we also had our critical-incident whiskey go around. Then the CEO walked in to congratulate us on that project. Whoops. But he's a good sport, so good times. :)

throwaway_ac · on June 7, 2019

I have mixed feelings about the finger pointing/public embarrassment thing. Usually the SRE is matured enough cause they have to be, however the individual teams might not be the same when it comes to reacting/handling the Incident report/postmortem.

On a slightly different note, "low-level team to have at least one engineer on call at any given time" - this line itself is so true and at the same time it has so many things wrong. Not sure what the best way to put the modern day slavery into words given that I have yet not seen any large org giving day off's for the low-level team engineer just cause they were on call.

KirinDave · on June 7, 2019

Having recently joined an SRE team at Google with a very large oncall component, fwiw I think the policies around oncall are fair and well-thought-out.

There is an understanding of how it impacts your time, your energy and your life that is impressive? To be honest, I feel bad for being so macho about oncall at the org I ran and just having the leads take it all upon ourselves.

wikibob · on June 7, 2019

What are the policies exactly? I’ve heard it’s equal time off for every night you are on call?

masto · on June 7, 2019

The SRE book (https://landing.google.com/sre/sre-book/chapters/being-on-ca...) says that engineers are compensated for being on call in the form of cash or time off.

Personally I think this is a fair system, and I would hardly call it slavery.

(disclaimer: am Google SRE)

virgilp · on June 7, 2019

It was paid or time off where I worked before. It's just being established where I work now, but what's discussed is 2x regular pay for working outside your work hours due to an incident. Doesn't seem "slavery" to me.

Moru · on June 7, 2019

In the places I have been working at (lots of different types of jobs), overtime used to be 2x pay or 2x off. None of them were IT-related though.

dbcurtis · on June 7, 2019

At one Large Org where I worked, the Pager Bearer was paid 25% time for all the time they were on the pager, and standard overtime rates (including weekend/holiday multipliers) from the time the pager went off until they cleared the problem and walked out the plant door, or logged out if the problem was diagnosed/fixed remotely.

25% time for carrying the pager was to compensate for: 1) Requirement to be able to get to the plant in 30 minutes. Fresh snow? Too bad, no skiing for you this weekend. 2) You must be sober and work-ready when the pager goes off. At a party? Great, but I hope you like cranberry juice.

As the customer who signed the time cards for the pager duty, I thought that was not only fair, but it also drove home to me as a manager that the cost was real and was coming out of my budget, not some general IT budget that someone else took the hit for. This is one case where "You want coverage for your service? Give me a charge code for the overtime." was not just senseless bureaucratic friction, it led to healthier, business-driven, decisions.

rorykoehler · on June 7, 2019

> I left because the SRE org in general at the time was too reactionary

It shows in their products (though it's improving)

achiang · on June 7, 2019

Google SRE doesn't have magical incident response beans that we hoard from the rest of the world. What makes Google SRE institutionally strong is that we have senior executive support to execute on all the best practices described in the book:

https://landing.google.com/sre/sre-book/toc/index.html

At my last job, I bought a copy of this book, but we only had the organizational bandwidth to do a few of the things mentioned. At Google, we do all of them.

The incident on Sunday basically played out as described in chapters 13 and 14. There is always the fog of war that exists during an incident, so no, it wasn't always people calmly typing into terminals, but having good structure in place keeps the madness manageable.

Disclosure: I work in Google NetInfra SRE, and while my department was/is heavily involved in this incident, I personally was not.

Also, we're [always] hiring:

https://careers.google.com/jobs/results/?company=Google&comp...

tazjin · on June 7, 2019

It's interesting to see it go down. There's some chaos involved, but from my perspective it's the constructive[0] kind.

If you're interested in how these sorts of incidents are managed, check out the SRE Book[1] - it has a chapter or two on this and many other related topics.

Disclosure: I work in Google Cloud, but not SRE.

[0]: https://principiadiscordia.com/book/70.php

[1]: https://landing.google.com/sre/books/

alasdair_ · on June 7, 2019

Our own version of Netflix's "Chaos Money" is named "Eris" for precisely the reason mentioned in your first footnote.

tobych · on June 7, 2019

Chaos Monkey: https://netflix.github.io/chaosmonkey/

lytfyre · on June 7, 2019

You might be interested in https://response.pagerduty.com/, PagerDuty's major incident response process documentation - a good starting point for that red binder.

Having been in the ringmasters seat for major incidents ranging from "relatively routine" to "it's all on fire", and had a ringside seat for a cloud provider outage of comparable magnitude to this one - it still fascinates me how creative solutions can get dreamed up under high pressure, and how effective someone to keep the response calm and _feel like it's in control_ is.

gbrayut · on June 7, 2019

Anyone know of other public resources like the one from PagerDuty? The SRE book and workbook at https://landing.google.com/sre/books/ have some details, but curious if there are others people would recommend.

di4na · on June 7, 2019

Anything out of Mission Control during Apollo. The Army has good stuff too. FEMA has some good stuff on how they apply it on the ground and train.

I particularly like Gene Kranz "Failure is not an Option". It is more background but it works. In general, it is not crazy hard. You get the roles, you distribute them. Someone can have multiple roles that depends on the size of the incident.

The usual roles i differentiate are Point (think of it as IC if you want), Comms and Logs

elliekelly · on June 7, 2019

Tangentially related, you might find the documentary "Out of the Clear Blue Sky" interesting. It's about bond trading firm Cantor Fitzgerald (headquartered on the top floors of the World Trade Center) in the days after 9/11. Not even the best plan could have helped them open for business in two days after having lost just about everything. Over the last decade or so we've put a lot of emphasis on documentation when it comes to incident response but the movie is really a testament to how leadership and execution are so much more important.