> Debugging the problem was significantly hampered by failure of tools competing...

abbeyj · on June 7, 2019

This reminded me of https://www.usenix.org/system/files/1311_05-08_mickens.pdf

   I HAVE NO TOOLS BECAUSE I’VE DESTROYED MY TOOLS WITH MY TOOLS.

tbodt · on June 8, 2019

You joke, but I've read a Google postmortem with that exact quote at the top. This incident would be more fitting though.

ikiris · on June 8, 2019

It's in this one now.

theflyinghorse · on June 7, 2019

What an amazing read!

raverbashing · on June 7, 2019

This happens way more often than you think

A very simple example, you do something stupid on a remote machine (either high network usage or CPU usage) over SSH then you can't undo it because SSH becomes unresponsive

isatty · on June 7, 2019

Yeah got locked out of a dedi this way (bad ipfw ruleset which killed my ssh connection) and the virtual console wasn’t working. Fun times. At least it was a personal machine.

Cthulhu_ · on June 7, 2019

Is there anything that can be done to mitigate that? E.g. give ssh network and the daemon top cpu priority?

FourthProtocol · on June 7, 2019

Recovery-oriented computing. Seperate Bohr bugs (easy to reproduce) from Heisenbugs (difficult to reproduce/change their behaviour once investigated).

Bohr bugs generate an alert and happily meander through normal support channels.

Heisenbugs go through phases -

1. Probation. On continued failure,

2. Restart. If the app or service fails after a restart,

3. Reboot. If the app or service fails after a reboot,

4. Re-image. If the app or service fails after re-imaging,

5. Remove/elimate the node.

vorpalhex · on June 7, 2019

Some equipment will auto-revert to a last known good configuration if you don't approve new changes within a window... though high CPU could lock that process up..

riskrisk · on June 8, 2019

in this case the old configuration was lost, it took an hour to rebuild because tooling normally used to rebuild it for testing was unavailable and building had to be done locally, using a single machine someone ssh-ed into, and that just takes a while. Luckily, a person was around who knew how to do the rebuild without fancy tooling.

bpchaps · on June 7, 2019

Yes - use taskset or isolcpus with other magic to put sshd on its own CPU core, or one core per CPU. Lots of HFT places do that.

dnautics · on June 9, 2019

That doesn't help if the problem is a bandwidth congestion problem.

bpchaps · on June 10, 2019

It can help some amount, though. Bind the NIC interrupts to a small handful of cores. Or, ensure that ssh only works through a management NIC, and have that NIC bound to the same cores as sshd. You can get really fancy with these setups, especially when working with NUMA stuffs

Causality1 · on June 7, 2019

I'm a bit surprised there's no sort of SSH undo subroutine that reverses the previous command if connectivity is lost. Of course it couldn't cover every possible stupid thing but it could fix simple stupid mistakes like fouling up a port assignment or disabling the wrong network adapter.

topranks · on June 7, 2019

A commit / confirm / rollback cycle is how this is typically dealt with in network automation.

wbl · on June 7, 2019

How does ssh know what command you did and how to reverse it?

londons_explore · on June 7, 2019

It isn't a worst-case though. They should have had the capability to resolve this issue with no network connectivity, which would be the worst case failure of the network control plane.

karlding · on June 7, 2019

I don't work as an SRE, but isn't that covered by providing engineers physical access to secure facilities in the absolute worst case?

The article states:

> The defense in depth philosophy means we have robust backup plans for handling failure of such tools, but use of these backup plans (including engineers travelling to secure facilities designed to withstand the most catastrophic failures, and a reduction in priority of less critical network traffic classes to reduce congestion) added to the time spent debugging.

log_n · on June 7, 2019

You don't have to go that far. You could also have automated roll-back of individual servers if they sense something is off, for instance.

Another alternative is low bandwidth flag based roll-backs (for instances such as this where the network is congested but not completely lost).

smileypete · on June 7, 2019

Use a 28k modem? :)

_n_b_ · on June 7, 2019

An anecdote: my (not-IT) company does exactly this for out-of-band management... except, in one small satellite location, the phone company no longer provided any copper POTS lines; all they could do was an RJ-11 jack out of the ONT that was backhauled as (lossy) VoIP. So the modem couldn't be made to work.

My point being, it seems that modems are becoming less-and-less viable for out-of-band management.

Bluecobra · on June 7, 2019

Fun story, AT&T forced our hands to get off our PRI (voice T1) and move to their fiber service. They also insisted that they have a dedicated phone line installed so they can dial into their modem in case of circuit failure. We can’t buy a cooper phone line from them, so it gets routed over the same fiber circuit and goes to a digital to analog device back to the router. I don’t think one hand talks to the other over there...

bnjms · on June 8, 2019

If this is a problem consider Opengear.

https://opengear.com/products/acm7000-resilience-gateway

oldcreek12 · on June 7, 2019

This is inconceivable ... they don't have an OOB management network?

kbirkeland · on June 7, 2019

A completely OOB management network is an amazingly high cost when you have presence all over the world. I don't think anybody has gone to the length to double up on dark fiber and OTN gear just for management traffic.

ddalex · on June 7, 2019

Hmm... With 5G each blade in the rack could get its own modem and sim card for OOB management.

bnjms · on June 8, 2019

Why would you do that when you could have 1 sim in a 96 port terminal server?

riskrisk · on June 8, 2019

That's less of an issue. Issue is in how you classify traffic on a network. e.g. Gmail, it's helpful to incident response, should it be used for OOB management.

ikiris · on June 8, 2019

Still have to get to it.

KirinDave · on June 7, 2019

You have no idea.