A very simple example, you do something stupid on a remote machine (either high network usage or CPU usage) over SSH then you can't undo it because SSH becomes unresponsive
Yeah got locked out of a dedi this way (bad ipfw ruleset which killed my ssh connection) and the virtual console wasn’t working. Fun times. At least it was a personal machine.
Some equipment will auto-revert to a last known good configuration if you don't approve new changes within a window... though high CPU could lock that process up..
in this case the old configuration was lost, it took an hour to rebuild because tooling normally used to rebuild it for testing was unavailable and building had to be done locally, using a single machine someone ssh-ed into, and that just takes a while. Luckily, a person was around who knew how to do the rebuild without fancy tooling.
It can help some amount, though. Bind the NIC interrupts to a small handful of cores. Or, ensure that ssh only works through a management NIC, and have that NIC bound to the same cores as sshd. You can get really fancy with these setups, especially when working with NUMA stuffs
I'm a bit surprised there's no sort of SSH undo subroutine that reverses the previous command if connectivity is lost. Of course it couldn't cover every possible stupid thing but it could fix simple stupid mistakes like fouling up a port assignment or disabling the wrong network adapter.
It isn't a worst-case though. They should have had the capability to resolve this issue with no network connectivity, which would be the worst case failure of the network control plane.
I don't work as an SRE, but isn't that covered by providing engineers physical access to secure facilities in the absolute worst case?
The article states:
> The defense in depth philosophy means we have robust backup plans for handling failure of such tools, but use of these backup plans (including engineers travelling to secure facilities designed to withstand the most catastrophic failures, and a reduction in priority of less critical network traffic classes to reduce congestion) added to the time spent debugging.
An anecdote: my (not-IT) company does exactly this for out-of-band management... except, in one small satellite location, the phone company no longer provided any copper POTS lines; all they could do was an RJ-11 jack out of the ONT that was backhauled as (lossy) VoIP. So the modem couldn't be made to work.
My point being, it seems that modems are becoming less-and-less viable for out-of-band management.
Fun story, AT&T forced our hands to get off our PRI (voice T1) and move to their fiber service. They also insisted that they have a dedicated phone line installed so they can dial into their modem in case of circuit failure. We can’t buy a cooper phone line from them, so it gets routed over the same fiber circuit and goes to a digital to analog device back to the router. I don’t think one hand talks to the other over there...
A completely OOB management network is an amazingly high cost when you have presence all over the world. I don't think anybody has gone to the length to double up on dark fiber and OTN gear just for management traffic.
That's less of an issue. Issue is in how you classify traffic on a network. e.g. Gmail, it's helpful to incident response, should it be used for OOB management.
Man that's got to suck.