p.14: "Because 99.999% availability provides for less than 5 minutes of downtime...

candiodari · on July 21, 2019

That just gets you back to automated systems, for example, causing their own failure, then responding to it by causing more failure.

For example, a relatively common occurence is a BGP link getting saturated. Get that situation bad enough and the BGP session will go down, which will redirect all that traffic to another link with another BGP session, which then proceeds to go down. Meanwhile the original session comes back up and ... And then the failures synchronize and cause a third link to go down (each time taking more traffic with it and therefore causing failures faster).

The second issue discussed is that the math in the statistics only works if the failures never synchronize. That's true for a lot of statistical analyses and mostly people ... just don't care. Yes that makes those analyses wrong. But we don't have a better way of doing those analyses.