Massive reconvergence event in their network, causing edge router bgp sessions t...

beagle3 · on Aug 30, 2020

History doesn't repeat, but it rhymes ....

There was a huge AT&T outage in 1990 that cut off most US long distance telephony (which was, at the time, mostly "everything not within the same area code").

It was a bug. It wasn't a reconvergence event, but it was a distant cousin: Something would cause a crash; exchanges would offload that something to other exchanges, causing them to crash -- but with enough time for the original exchange to come back up, receive the crashy event back, and crash again.

The whole network was full of nodes crashing, causing their peers to crash, ad infinitum. In order to bring the network back up, they needed to either take everything down at the same time (and make sure all the queues are emptied), but even that wouldn't have made it stable, because a similar "patient 0" event would have brought the whole network down.

Once the problem was understood, they reverted to an earlier version which didn't have the bug, and the network re-stabilized.

The lore I grew up on is that this specific event was very significant in pushing and funding research into robust distributed systems, of which the best known result is Erlang and its ecosystem - originally built, and still mostly used, to make sure that phone exchanges don't break.

[0] https://users.csc.calpoly.edu/~jdalbey/SWE/Papers/att_collap...

phkahler · on Aug 30, 2020

Contrary to what that link says, the software was not thoroughly tested. Normal testing was bypassed - per management request after a small code change.

This was covered in a book (perhaps Safeware, but maybe another one I dont recall) along with the Therac 25, the Ariane V, and several others. Unfortunately these lessons need to be relearned by each generation. See the 737-Max...

jacquesm · on Aug 30, 2020

> Normal testing was bypassed - per management request after a small code change.

That lesson will really never be learned. This happens on a daily basis all over the planet with people who have not been bitten - yet.

cortesoft · on Aug 30, 2020

It isn't learned because 99% of the time, it works fine and nothing bad happens.

We are very bad at avoiding these sorts of rare, catastrophic events.

eru · on Aug 31, 2020

That's why the most reliable way to instil this lesson is to instil it into our tools. Automate as much testing as possible, so that bypassing the tests becomes more work than running them.

smaudet · on Aug 31, 2020

I disagree, it's in part a people problem - more draconian test suites just make developers more inclined to cheat and they tend to write tests which are not valid or just get the tool passing...

It's more important to visually model and test than to enforce some arbitrary set of rules that don't apply universally - then you have at least the visual impetus of 'this is wrong' or 'I need to test this right'.

A lot of time is spent visually testing UIs and yet these same people struggle with testing the code that matters...

monkpit · on Aug 31, 2020

Until a manager is told about how hard the automation makes it to accomplish their goal...

eru · on Sept 1, 2020

You need buy-in to automation at a high enough level.

If a team manager at eg Google was complaining about how automation gets in the way and wanted to bypass it, they wouldn't last too long.

edoceo · on Aug 30, 2020

Managers who have been bitten still make this choice

coryrc · on Aug 30, 2020

Or most of what we do isn't really important so it doesn't matter if it breaks every once in a while.

pacificmint · on Aug 30, 2020

Probably not the book you are thinking off, since it’s just about the AT&T incident, but “The Day the Phones Stopped Ringing” by Leonard Lee is a detailed description of the event.

It’s been many years since I read it, but I recall it being a very interesting read.

tinco · on Aug 30, 2020

For some reason in my university almost every CS class would start with an anecdote about the Therac 25, Ariane V, and/or a couple others as a motivation on why we the class existed. It was sort of a meme.

The lessons are definitely still taught, I don't know if they're actually learned of course.. And who knows who actually taught the 737-Max software devs, I don't suppose they're fresh out of uni.

TheSpiceIsLife · on Aug 31, 2020

Do management typical typically study Computer Science?

_heimdall · on Aug 31, 2020

Unfortunately most people become a manager by bring a stellar independent contributor. People management and engineering are very different skills, I'm always impressed when I see someone make that jump smoothly.

I always wanted companies to hire people managers as its own career path. An engineer can be an excellent technical lead or architect, but it can feel like you started over once you're responsible for the employees, their growth, and their career path.

thatfunkymunki · on Aug 31, 2020

Yeah, it just sucks that you eventually have someone making significant people management decisions without the technical knowledge of what the consequences could end up being. This would be even worse if you had people manager hiring be completely decoupled. The US military works this way and I have to say it's not the best mode.

tinco · on Aug 31, 2020

Typically yes actually, the director of engineering should always be an engineer. Of course, these are hardware companies so it would probably be some kind of hardware engineer.

TheSpiceIsLife · on Sept 1, 2020

Should.

Sure.

_heimdall · on Aug 31, 2020

As a former AT&T contractor, albeit from years later, this checks out. Sat in a "red jeopardy" meeting once because a certain higher-up couldn't access the AT&T branded security system at one of his many houses.

The build that broke it was rushed out and never fully tested, adding a fairly useless feature for said higher-up that improved the UX for users with multiple houses on their account.

twic · on Aug 30, 2020

This reminds me of an incident on the early internet (perhaps ARPANET at that point) where a routing table got corrupted so it had a negative-length route which routers then propagated to each other, even after the original corrupt router was rebooted. As with AT&T, they had to reboot all the routers at once to get rid of the corruption.

I can't remember where i read about this, but i recall the problem was called "The Creeping Crud from California". Sadly, this phrase apparently does not appear anywhere on the internet. Did i imagine this?

ficklepickle · on Aug 30, 2020

I can't find anything by that name either, but the details do match the major ARPANET outage of Oct 27, 1980.

The incident is detailed in RFC 789:

http://www.faqs.org/rfcs/rfc789.html#b

twic · on Aug 30, 2020

Interesting, thanks! That is different to the story i remember, but it's possible that i remember incorrectly, or read an incorrect explanation.

I believe that i read about this episode in Hans Moravec's book 'Mind Children'. I can see in Google Books that chapter 5 is on 'Wildlife', and there is a section 'Spontaneous Generation', which promises to talk about a "software parasite" which emerged naturally in the ARPAnet - but of which the bulk is not available:

https://books.google.co.uk/books?id=56mb7XuSx3QC&lpg=PA133&d...

0xbadcafebee · on Aug 30, 2020

I have spent hours and hours banging my head against Erlang distributed system bugs in production. I am absolutely mystified why anyone thought just using a particular programming language would prevent these scenarios. If it's Turing-complete, expect the unexpected.

BoorishBears · on Aug 30, 2020

The idea isn't that Erlang is infallible in the design of distributed systems.

The idea is it takes away enough foot-guns that if you're banging your head against systems written it in, you'd be banging your head even harder and more often if the same implementor had used another language

mcspiff · on Aug 30, 2020

There was something similar a few years ago on a large US mobile network. You could watch the ‘storm’ rolling across the map. Fascinating stuff

throw0101a · on Aug 30, 2020

Are you referring to CenturyLink’s 37-hour, nationwide outage?

> In this instance, the malformed packets [Ethernet frames?] included fragments of valid network management packets that are typically generated. Each malformed packet shared four attributes that contributed to the outage: 1) a broadcast destination address, meaning that the packet was directed to be sent to all connected devices; 2) a valid header and valid checksum; 3) no expiration time, meaning that the packet would not be dropped for being created too long ago; and 4) a size larger than 64 bytes.

* https://arstechnica.com/information-technology/2019/08/centu...

hinkley · on Aug 30, 2020

I think we used to call that a poison pill message (still bring it up routinely when we talk about load balancing and why infinite retries are a very, very bad idea).

mikelward · on Aug 30, 2020

Some queue processing systems I've seen have infinite retries.

At least they have exponential backoff I guess.

hinkley · on Aug 30, 2020

But your queue will grow and grow and the fraction of time you spend servicing old messages grows and grows.

Not a terribly big fan of these queueing systems. People always seem to bung things up in ways they are not quite equipped to fix (in the “you are not smart enough to debug the code you wrote” sense).

Last time I had to help someone with such a situation, we discovered that the duplicate processing problem had existed for >3 months prior to the crisis event, and had been consuming 10% of the system capacity, which was just low enough that nobody noticed.

mikelward · on Aug 30, 2020

We also alert if any message is in the queue too long.

If anything, the alert is too sensitive.

doctorshady · on Aug 30, 2020

The thing with feature group D trunks to the long distance network is you could (and still can on non-IP/mobile networks) manually route to another long distance carrier like Verizon, and sidestep the outage from the subscriber end, full stop. That's certainly not possible with any of the contemporary internet outages.

p_l · on Aug 30, 2020

you can inject changes in routing, but if the other other carrier doesn't route around the affected network, you're back to square one. That's part of why Level3/CenturyLink was depeered and why several prefixes that are normally announced through it were quickly rerouted by owners.

doctorshady · on Aug 30, 2020

That's my point; as a subscriber, you can prefix a long distance call with a routing code to avoid, for example, a shut down long distance network without any administrator changes. Routing to the long distance networks is done independently through the local network, so if AT&T's long distance network was having issues, it'd have no impact on your ability to access Verizon's long distance network.

a1369209993 · on Aug 31, 2020

There's actually no technical reason why you couldn't do that with IP (4 or 6); although you'd need a approriately located host to be running a relay daemon[0].

0: ie something that takes, say, a UDP packet on port NNNN containing a whole raw IPv4 packet, throws away the wrapping, and drops the IPv4 packet onto its own network interface. This is safe - the packet must shrink by a dozen or two bytes with each retransmission - but usually not actually set up anywhere.

Edit: It probably wouldn't work for TCP though - maybe try TOR?

KMag · on Aug 31, 2020

There are plenty of ways to do what you're describing, and they all work with TCP. Some of them only work if the encapsulated traffic is IPv6 (and a designed to give IPv6 access on ISPs that only support IPv4). Some of them may end up buffering the TCP stream and potentially generating packet boundaries at different locations than in the original TCP stream.

[0] https://en.wikipedia.org/wiki/Generic_Routing_Encapsulation

[1] https://en.wikipedia.org/wiki/Teredo_tunneling

[2] https://en.wikipedia.org/wiki/6to4

[3] Any of the various https://en.wikipedia.org/wiki/Virtual_private_network technologies (WireGuard, IPSec, SOCKS TLS proxies, etc.)

[3] As you mention, a Tor SOCKS proxy

p_l · on Aug 31, 2020

There is, technically, a way for IP packets to signify preferred routes, but due to other (security) reasons it's disabled.

cat199 · on Aug 30, 2020

> best known result is Erlang and its ecosystem

not expert but erlang is listed as 1986, so that would seem not directly related https://en.wikipedia.org/wiki/Erlang_(programming_language)

dadver · on Aug 30, 2020

This sounds like the event that is described in the book Masters of Deception: The gang that ruled cyberspace. The way I remember it the book attributes the incident to MoD, while of course still being the result of a bug/faulty design.

swinglock · on Aug 30, 2020

Indeed. In 2018 an Erlang telco software did break, bringing down the UK and Japan.

pcc · on Aug 30, 2020

If memory serves that also involved an expired certificate

swinglock · on Aug 30, 2020

That matches my memory.

sbmthakur · on Aug 31, 2020

A thread discussing that event:

https://news.ycombinator.com/item?id=24323412

_gqhf · on Aug 30, 2020

Is that related to the hacker's crackdown?

chrisweekly · on Aug 30, 2020

Fascinating. Thanks for sharing! :)

kitteh · on Aug 30, 2020

Most of level3s settlement free peers aka "tier 1s" have shutdown or depreffed their sessions with them.

Example: https://mobile.twitter.com/TeliaCarrier/status/1300074378378...

kitteh · on Aug 30, 2020

Root cause identified. Folks are turning things back on now.

emilstahl · on Aug 30, 2020

Source?

miken123 · on Aug 30, 2020

https://mobile.twitter.com/Gustawsson/status/130008521478233...

guerrilla · on Aug 30, 2020

What is a reconvergence event? Is that what's described in your last sentence?

snuxoll · on Aug 30, 2020

BGP is a path-vector routing protocol, every router on the internet is constantly updating its routing tables based on paths provided by its peers to get the shortest distance to an advertised prefix. When a new route is announced it takes time to propagate through the network and for all routers in the chain to “converge” into a single coherent view.

If this is indeed a reconvergence event, that would imply there’s been a cascade of route table updates that have been making their way through CTL/L3’s network - meaning many routers are missing the “correct” paths to prefixes and traffic is not going where it is supposed to, either getting stuck in a routing loop or just going to /dev/null because the next hop isn’t available.

This wouldn’t be such a huge issue if downstream systems could shut down their BGP sessions with CTL and have traffic come in via other routes, but doing so is not resulting in the announcements being pulled from the Level 3 AS - something usually reflective of the CPU on the routers being overloaded processing route table updates or an issue with the BGP communication between them.

Convergence time is a known bugbear of BGP.

mitchs · on Aug 31, 2020

BGP operates as a rumor mill. Convergence is the process of all of the rumors settling into a steady state. The rumors are of the form "I can reach this range of IP addresses by going through this path of networks." Networks will refuse to listen to rumors that have themselves in the path, as that would cause traffic to loop.

For each IP range described in the rumor table, each network is free to choose whichever rumor they like best among all they have heard, and send traffic for that range along the described path. Typically this is the shortest, but it doesn't have to be.

ISPs will pass on their favorite rumor for each range, adding themselves to the path of networks. (They must also withdraw the rumors if they become disconnected from their upstream source, or their upstream withdraws it.) Business like hosting providers won't pass on any rumors other than those they started, as no one involved wants them to be a path between the ISPs. (Most ISPs will generally restrict the kinds of rumors their non ISP peers can spread, usually in terms of what IP ranges the peer owns.)

Convergence in BGP is easy in the "good news" direction, and a clusterfuck in the "bad news" direction. When a new range is advertised, or the path is getting shorter, it is smooth sailing, as each network more or less just takes the new route as is and passes it on without hesitation. In the bad news direction, where either something is getting retracted entirely, or the path is going to get much longer, we get something called "path hunting."

As an example of path hunting: Lets say the old paths for a rumor were A-B-C and A-B-D, but C is also connected to D. (C and D spread rumors to each other, but the extended paths A-B-C-D and A-B-D-C are longer, thus not used yet.) A-B gets cut. B tells both C and D that it is withdrawing the rumor. Simultaneously D looks at the rumor A-B-C-D and C looks at the rumor A-B-D-C, and say "well I've got this slightly worse path lying around, might as well use it." Then they spread that rumor to their down streams not realizing that it is vulnerable to the same event that cost them the more direct route. (They have no idea why B withdraw the rumor from them.) The paths, especially when removing an IP range entirely, can get really crazy. (A lot of core internet infrastructure uses delays to prevent the same IP range from updating too often, which tamps down on the crazy path exploration and can actually speed things up in these cases.)

swinglock · on Aug 30, 2020

https://en.wikipedia.org/wiki/Convergence_(routing)

IP network routing is distributed systems within distributed systems. For whatever reason the distributed system that is the CenturyLink network isn't "converging", or we could it becoming consistent, or settling, in a timely manner.

Yajirobe · on Aug 30, 2020

I know some of these words

rimjongun · on Aug 30, 2020

Can you tell tell me more about what happened, but in a way that for a person who struggled with the CCNA? I’ve never heard of a reconvergence event.