Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

p.14:

"Because 99.999% availability provides for less than 5 minutes of downtime per year, every operation performed to the system in production will need to be automated and tested with the same level of care. With a 5 minute per year budget, human judgement and action is completely off the table for failure recovery"



That just gets you back to automated systems, for example, causing their own failure, then responding to it by causing more failure.

For example, a relatively common occurence is a BGP link getting saturated. Get that situation bad enough and the BGP session will go down, which will redirect all that traffic to another link with another BGP session, which then proceeds to go down. Meanwhile the original session comes back up and ... And then the failures synchronize and cause a third link to go down (each time taking more traffic with it and therefore causing failures faster).

The second issue discussed is that the math in the statistics only works if the failures never synchronize. That's true for a lot of statistical analyses and mostly people ... just don't care. Yes that makes those analyses wrong. But we don't have a better way of doing those analyses.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: