Massive reconvergence event in their network, causing edge router bgp sessions to bounce (due to cpu). Right now all their big peers are shutting down sessions with them to give level3s network the ability to reconverge. Prefixes announced to 3356 are frozen on their route reflectors and not getting withdrawn.
Edit: if you are a Level3 customer shut your sessions down to them.
There was a huge AT&T outage in 1990 that cut off most US long distance telephony (which was, at the time, mostly "everything not within the same area code").
It was a bug. It wasn't a reconvergence event, but it was a distant cousin: Something would cause a crash; exchanges would offload that something to other exchanges, causing them to crash -- but with enough time for the original exchange to come back up, receive the crashy event back, and crash again.
The whole network was full of nodes crashing, causing their peers to crash, ad infinitum. In order to bring the network back up, they needed to either take everything down at the same time (and make sure all the queues are emptied), but even that wouldn't have made it stable, because a similar "patient 0" event would have brought the whole network down.
Once the problem was understood, they reverted to an earlier version which didn't have the bug, and the network re-stabilized.
The lore I grew up on is that this specific event was very significant in pushing and funding research into robust distributed systems, of which the best known result is Erlang and its ecosystem - originally built, and still mostly used, to make sure that phone exchanges don't break.
Contrary to what that link says, the software was not thoroughly tested. Normal testing was bypassed - per management request after a small code change.
This was covered in a book (perhaps Safeware, but maybe another one I dont recall) along with the Therac 25, the Ariane V, and several others. Unfortunately these lessons need to be relearned by each generation. See the 737-Max...
That's why the most reliable way to instil this lesson is to instil it into our tools. Automate as much testing as possible, so that bypassing the tests becomes more work than running them.
I disagree, it's in part a people problem - more draconian test suites just make developers more inclined to cheat and they tend to write tests which are not valid or just get the tool passing...
It's more important to visually model and test than to enforce some arbitrary set of rules that don't apply universally - then you have at least the visual impetus of 'this is wrong' or 'I need to test this right'.
A lot of time is spent visually testing UIs and yet these same people struggle with testing the code that matters...
Probably not the book you are thinking off, since it’s just about the AT&T incident, but “The Day the Phones Stopped Ringing” by Leonard Lee is a detailed description of the event.
It’s been many years since I read it, but I recall it being a very interesting read.
For some reason in my university almost every CS class would start with an anecdote about the Therac 25, Ariane V, and/or a couple others as a motivation on why we the class existed. It was sort of a meme.
The lessons are definitely still taught, I don't know if they're actually learned of course.. And who knows who actually taught the 737-Max software devs, I don't suppose they're fresh out of uni.
Unfortunately most people become a manager by bring a stellar independent contributor. People management and engineering are very different skills, I'm always impressed when I see someone make that jump smoothly.
I always wanted companies to hire people managers as its own career path. An engineer can be an excellent technical lead or architect, but it can feel like you started over once you're responsible for the employees, their growth, and their career path.
Yeah, it just sucks that you eventually have someone making significant people management decisions without the technical knowledge of what the consequences could end up being. This would be even worse if you had people manager hiring be completely decoupled. The US military works this way and I have to say it's not the best mode.
Typically yes actually, the director of engineering should always be an engineer. Of course, these are hardware companies so it would probably be some kind of hardware engineer.
As a former AT&T contractor, albeit from years later, this checks out. Sat in a "red jeopardy" meeting once because a certain higher-up couldn't access the AT&T branded security system at one of his many houses.
The build that broke it was rushed out and never fully tested, adding a fairly useless feature for said higher-up that improved the UX for users with multiple houses on their account.
This reminds me of an incident on the early internet (perhaps ARPANET at that point) where a routing table got corrupted so it had a negative-length route which routers then propagated to each other, even after the original corrupt router was rebooted. As with AT&T, they had to reboot all the routers at once to get rid of the corruption.
I can't remember where i read about this, but i recall the problem was called "The Creeping Crud from California". Sadly, this phrase apparently does not appear anywhere on the internet. Did i imagine this?
Interesting, thanks! That is different to the story i remember, but it's possible that i remember incorrectly, or read an incorrect explanation.
I believe that i read about this episode in Hans Moravec's book 'Mind Children'. I can see in Google Books that chapter 5 is on 'Wildlife', and there is a section 'Spontaneous Generation', which promises to talk about a "software parasite" which emerged naturally in the ARPAnet - but of which the bulk is not available:
I have spent hours and hours banging my head against Erlang distributed system bugs in production. I am absolutely mystified why anyone thought just using a particular programming language would prevent these scenarios. If it's Turing-complete, expect the unexpected.
The idea isn't that Erlang is infallible in the design of distributed systems.
The idea is it takes away enough foot-guns that if you're banging your head against systems written it in, you'd be banging your head even harder and more often if the same implementor had used another language
Are you referring to CenturyLink’s 37-hour, nationwide outage?
> In this instance, the malformed packets [Ethernet frames?] included fragments of valid network management packets that are typically generated. Each malformed packet shared four attributes that contributed to the outage: 1) a broadcast destination address, meaning that the packet was directed to be sent to all connected devices; 2) a valid header and valid checksum; 3) no expiration time, meaning that the packet would not be dropped for being created too long ago; and 4) a size larger than 64 bytes.
I think we used to call that a poison pill message (still bring it up routinely when we talk about load balancing and why infinite retries are a very, very bad idea).
But your queue will grow and grow and the fraction of time you spend servicing old messages grows and grows.
Not a terribly big fan of these queueing systems. People always seem to bung things up in ways they are not quite equipped to fix (in the “you are not smart enough to debug the code you wrote” sense).
Last time I had to help someone with such a situation, we discovered that the duplicate processing problem had existed for >3 months prior to the crisis event, and had been consuming 10% of the system capacity, which was just low enough that nobody noticed.
The thing with feature group D trunks to the long distance network is you could (and still can on non-IP/mobile networks) manually route to another long distance carrier like Verizon, and sidestep the outage from the subscriber end, full stop. That's certainly not possible with any of the contemporary internet outages.
you can inject changes in routing, but if the other other carrier doesn't route around the affected network, you're back to square one. That's part of why Level3/CenturyLink was depeered and why several prefixes that are normally announced through it were quickly rerouted by owners.
That's my point; as a subscriber, you can prefix a long distance call with a routing code to avoid, for example, a shut down long distance network without any administrator changes. Routing to the long distance networks is done independently through the local network, so if AT&T's long distance network was having issues, it'd have no impact on your ability to access Verizon's long distance network.
There's actually no technical reason why you couldn't do that with IP (4 or 6); although you'd need a approriately located host to be running a relay daemon[0].
0: ie something that takes, say, a UDP packet on port NNNN containing a whole raw IPv4 packet, throws away the wrapping, and drops the IPv4 packet onto its own network interface. This is safe - the packet must shrink by a dozen or two bytes with each retransmission - but usually not actually set up anywhere.
Edit: It probably wouldn't work for TCP though - maybe try TOR?
There are plenty of ways to do what you're describing, and they all work with TCP. Some of them only work if the encapsulated traffic is IPv6 (and a designed to give IPv6 access on ISPs that only support IPv4). Some of them may end up buffering the TCP stream and potentially generating packet boundaries at different locations than in the original TCP stream.
This sounds like the event that is described in the book Masters of Deception: The gang that ruled cyberspace. The way I remember it the book attributes the incident to MoD, while of course still being the result of a bug/faulty design.
BGP is a path-vector routing protocol, every router on the internet is constantly updating its routing tables based on paths provided by its peers to get the shortest distance to an advertised prefix. When a new route is announced it takes time to propagate through the network and for all routers in the chain to “converge” into a single coherent view.
If this is indeed a reconvergence event, that would imply there’s been a cascade of route table updates that have been making their way through CTL/L3’s network - meaning many routers are missing the “correct” paths to prefixes and traffic is not going where it is supposed to, either getting stuck in a routing loop or just going to /dev/null because the next hop isn’t available.
This wouldn’t be such a huge issue if downstream systems could shut down their BGP sessions with CTL and have traffic come in via other routes, but doing so is not resulting in the announcements being pulled from the Level 3 AS - something usually reflective of the CPU on the routers being overloaded processing route table updates or an issue with the BGP communication between them.
BGP operates as a rumor mill. Convergence is the process of all of the rumors settling into a steady state. The rumors are of the form "I can reach this range of IP addresses by going through this path of networks." Networks will refuse to listen to rumors that have themselves in the path, as that would cause traffic to loop.
For each IP range described in the rumor table, each network is free to choose whichever rumor they like best among all they have heard, and send traffic for that range along the described path. Typically this is the shortest, but it doesn't have to be.
ISPs will pass on their favorite rumor for each range, adding themselves to the path of networks. (They must also withdraw the rumors if they become disconnected from their upstream source, or their upstream withdraws it.) Business like hosting providers won't pass on any rumors other than those they started, as no one involved wants them to be a path between the ISPs. (Most ISPs will generally restrict the kinds of rumors their non ISP peers can spread, usually in terms of what IP ranges the peer owns.)
Convergence in BGP is easy in the "good news" direction, and a clusterfuck in the "bad news" direction. When a new range is advertised, or the path is getting shorter, it is smooth sailing, as each network more or less just takes the new route as is and passes it on without hesitation. In the bad news direction, where either something is getting retracted entirely, or the path is going to get much longer, we get something called "path hunting."
As an example of path hunting: Lets say the old paths for a rumor were A-B-C and A-B-D, but C is also connected to D. (C and D spread rumors to each other, but the extended paths A-B-C-D and A-B-D-C are longer, thus not used yet.) A-B gets cut. B tells both C and D that it is withdrawing the rumor. Simultaneously D looks at the rumor A-B-C-D and C looks at the rumor A-B-D-C, and say "well I've got this slightly worse path lying around, might as well use it." Then they spread that rumor to their down streams not realizing that it is vulnerable to the same event that cost them the more direct route. (They have no idea why B withdraw the rumor from them.) The paths, especially when removing an IP range entirely, can get really crazy. (A lot of core internet infrastructure uses delays to prevent the same IP range from updating too often, which tamps down on the crazy path exploration and can actually speed things up in these cases.)
IP network routing is distributed systems within distributed systems. For whatever reason the distributed system that is the CenturyLink network isn't "converging", or we could it becoming consistent, or settling, in a timely manner.
Edit: if you are a Level3 customer shut your sessions down to them.