Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

HN was down because the failover server also failed: https://twitter.com/HNStatus/status/1545409429113229312

Double disk failure is improbable but not impossible.

The most impressive thing is that there seems to be no dataloss, almost whatsoever. Whatever the backup system is, it seems rock solid.



> Double disk failure is improbable but not impossible.

It's not even improbable if the disks are the same kind purchased at the same time.


I once had a small fleet of SSDs fail because they had some uptime counters that overflowed after 4.5 years, and that somehow persistently wrecked some internal data structures. It turned them into little, unrecoverable bricks.

It was not awesome seeing a bunch of servers go dark in just about the order we had originally powered them on. Not a fun day at all.


You are never going to guess how long the HN SSDs were in the servers... never ever... OK... I'll tell you: 4.5years. I am not even kidding.


Let me narrow my guess: They hit 4 years, 206 days and 16 hours . . . or 40,000 hours.

And that they were sold by HP or Dell, and manufactured by SanDisk.

Do I win a prize?

(None of us win prizes on this one).


These were made by SanDisk (SanDisk Optimus Lightning II) and the number of hours is between 39,984 and 40,032... I can't be precise because they are dead and I am going off of when the hardware configurations were entered in to our database (could have been before they were powered on) or when we handed them over to HN, and when the disks failed.

Unbelievable. Thank you for sharing your experience!


Wow. It's possible that you have nailed this.

Edit: here's why I like this theory. I don't believe that the two disks had similar levels of wear, because the primary server would get more writes than the standby, and we switched between the two so rarely. The idea that they would have failed within hours of each other because of wear doesn't seem plausible.

But the two servers were set up at the same time, and it's possible that the two SSDs had been manufactured around the same time (same make and model). The idea that they hit the 40,000 hour mark within a few hours of each other seems entirely plausible.

Mike of M5 (mikiem in this thread) told us today that it "smelled like a timing issue" to him, and that is squarely in this territory.


This morning, I googled for issues with the firmware and the model of SSD, I got nothing. But now I am searching for "40000 hours SSD" and a million relevant results. Of course, why would I search for 40000 hours.

This thread is making me feel a lot less crazy.


I'm hoping that deep in your spam folder is a critical firmware update notice from Dell/EMC/HP/SanDisk from 2 years ago :).


There are times I don't miss dealing with random hardware mystery bullshit.

This one is just ... maddening.


This kind of thing is why I love Hacker News. Someone runs into a strange technical situation, and someone else happens to share their own obscure, related anecdote, which just happens to precisely solve the mystery. Really cool to see it benefit HN itself this time.


It's also an example of the dharma of /newest – the rising and falling away of stories that get no attention:

HPE releases urgent fix to stop enterprise SSDs conking out at 40K hours - https://news.ycombinator.com/item?id=22706968 - March 2020 (0 comments)

HPE SSD flaw will brick hardware after 40k hours - https://news.ycombinator.com/item?id=22697758 - March 2020 (0 comments)

Some HP Enterprise SSD will brick after 40000 hours without update - https://news.ycombinator.com/item?id=22697001 - March 2020 (1 comment)

HPE Warns of New Firmware Flaw That Bricks SSDs After 40k Hours of Use - https://news.ycombinator.com/item?id=22692611 - March 2020 (0 comments)

HPE Warns of New Bug That Kills SSD Drives After 40k Hours - https://news.ycombinator.com/item?id=22680420 - March 2020 (0 comments)

(there's also https://news.ycombinator.com/item?id=32035934, but that was submitted today)


Easy to imagine why this didn’t capture peoples’ attention in late March 2020…


Yes, an enterprisey firmware update - all very boring until BLAM!


Was HN an indirect casualty of Covid?


Interesting how something that is so specifically and unexpectedly devastating, yet known for such a long time without any serious public awareness from companies involved, is referred to as a "bug".

It makes you lose data and need to purchase new hardware, where I come from, that's usually referred to as "planned" or "convenient" obsolescence.


The difference between planned and convenient seems to be intent. And in this context that difference very much matters. I wouldn’t conflate the two.


Depends on who exactly we are talking about as having the intent...

Both planned and convenient obsolescence are beneficial to device manufacturers. Without proper accountability for that, it only becomes a normal practice.


> Depends on who exactly we are talking about as having the intent...

The manufacturer, obviously. Who else would it be?

Could be an innocent mistake or a deliberate decision. Further action should be predicated on the root cause. Which includes intent.


Popularity is a very poor relevance / truth heuristic.


I wanted to upvote this comment but that just feels wrong.


You're a good man, Charlie Brown.


I wonder if it might be closer to 40,032 hours. The official Dell wording [1] is "after approximately 40,000 hours of usage". 2^57 nanoseconds is 40031.996687737745 hours. Not sure what's special about 57, but a power of 2 limit for a counter makes sense. That time might include some manufacturer testing too.

[1] https://www.reddit.com/r/sysadmin/comments/f5k95v/dell_emc_u...


It might not be nanoseconds, but something that's a power of 2 number of nanoseconds going into an appropriately small container seems likely. For example, a 62.5MHz counter going into 53 bits breaks at the same limit. Why 53 bits? That's where things start to get weird with IEEE doubles - adding 1 no longer fits into the mantissa and the number doesn't change. So maybe someone was doing a bit of fp math to figure out the time or schedule a next event? Anyway, very likely some kind of clock math that wrapped or saturated and broke a fundamental assumption.


53 is indeed a magic value for IEEE doubles, but why would anybody count an inherently integer value with floating-point? That's a serious rookie mistake.

Of course there's no law that says SSD firmware writers can't be rookies.


Full stack JS, everything is a double down to the SSD firmware!


See! People should register via mail for those important notifications! (Or alternatively do quarterly checks that your firmware is up to date).


A lot of companies have teams dedicated to hardware that don’t give a shit about it. And their managers don’t give a shit.

Then the people under them who do give a shit, because they depend on those servers, aren’t allowed to register with HP etc for updates, or to apply firmware updates, because “separation of duties”.

Basically, IT is cancer from the head down.



Do they use SSD on space missions aswell?


Only for 4 years, 206 days and 16 hours.


is this leased to HN as dedicated/baremetal servers or colocation aka HN owns the hardware?


The former.


It's concerning that a hosting company was unaware of the 40,000 hour situation with SSD it was deploying. Anyone in hosting would have been made aware of this, or at least should have kept a better grip on happenings in the market.


Yeah, this is why you run all equipment in a test environment for 4.5 years before deploying it to prod. Really basic stuff.


The HD makers started issuing warnings in 2020... this was foreseeable


How many other customers will/have hit this?


Every large DC will have hit it (Amazon, Facebook, Google, etc). But it's a shame that all their operational knowledge is kept secret.


I understand BackBlaze is more HDD rather than SSD, but perhaps they might have some level of awareness.


I had a similar issue, but it was a single RAID-5 array and wear of some other manufacture defect. They were the same brand, model, and batch. When the first failed and the array got in recovery mode I ordered 3 replacements and upped the backup frequency. It was good that I did that because the two remaining drives died shortly after.

The lesson I learned is that the three replacements went to different arrays and we never again let drives from the same batch be part of the same array.


There's a principle in aviation of staggering engine maintenance on multiple-engined airplanes to avoid maintenance-induced errors leading to complete power loss.

e.g. Simultaneous Engine Maintenance Increases Operating Risks, Aviation Mechanics Bulletin, September–October 1999 https://flightsafety.org/amb/amb_sept_oct99.pdf


Yep: if you buy a pair disks together, there's a fair chance they'll both be from the same manufacturing batch, which correlates with disk defects.


Yeah just coming here to say this. Multiple disk failures are pretty probable. I've had batches of both disks and SSDs with sequential serial numbers, subjected to the same workloads, all fail within the same ~24 hour periods.


Had the same experience with (identical) SSDs, two failures within 10 minutes in a RAID 5 configuration.

(Thankfully, they didn't completely die but just put themselves into read-only)


Seems like it was only a few days ago that there was a comment from a former Dropbox engineer here pointing out that a lot of disk drives they bought when they stood up their own datacenter had been found to all have a common flaw involving tiny metal slivers.


This makes total sense but I've never heard of it. Is there any literature or writing about this phenomenon?

I guess proper redundancy is having different brands of equipment also in some cases.


I hadn't heard of it either until disks in our storage cluster at work started failing faster than the cluster could rebuild in an event our ops team named SATApocalypse. It was a perfect storm of cascading failures.

https://web.archive.org/web/20220330032426/https://ops.faith...


Great read, thank you!


I also don't know about literature on this phenomenon, but i recall HP had two different SSD recalls because when the uptime counter rolled over, they would fail. That's not even load dependent, just did you get a batch and power them on all at the same time. Uptime is too high causing issues isn't that unusual for storage, unfortunately.

It's not always easy, but if you can, you want manufacturer diversity, batch diversity, maybe firmware version diversity[1], and power on time diversity. That adds a lot of variables if you need to track down issues though.

[1] you don't want to have versions with known issues that affect you, but it's helpful to have different versions to diagnose unknown issues.


The crucial M4 had this too but it was fixable with a firmware update.

https://www.neoseeker.com/news/18098-64gb-crucial-m4s-crashi...


That one looks not too bad, seems like you can fix it with a firmware update after it fails. A lot of disk failures due to firmware bugs end up with the disk not responding to the bus, so it becomes somewhere between impossible and impractical to update the firmware.


I don't know about literature, but in the world of RAID this is a common warning.

Having a RAID5 crash and burn because the backup disk failed during the reconstruction phase after a primary disk failed is a common story.


Not sure about literature but that was a known thing in the Ops circles I was in 10 years ago: never use the same brand for disk pairs, to minimize wear-and-tear related defects from arising at the same time.


We used to use the same brand, but different models or at least ensure they were from different manufacturing batches.


Wikipedia has a section on this. It's called "correlated failure." https://en.wikipedia.org/wiki/RAID#Correlated_failures


Not sure about literature, but past anecdotes and HN threads yes.

https://news.ycombinator.com/item?id=4989579


This is why I try to mismatch manufacturers in RAID arrays. I'm told there is a small performance hit (things run towards the speed of the slowest, separately in terms of latency and throughput) but I doubt the difference is high and I like the reduction in potential failure-during-rebuild rates. Of course I have off-machine and off-site backups as well as RAID, but having to use them to restore a large array would be a greater inconvenience than just being able to restore the array (followed by checksum verifies over the whole lot for paranoia's sake).


Eek - now I'm glad I wait a few months before buying each disk for my NAS.

Not doing it for this reason but rather financial ones :) But as I have a totally mixed bunch of sizes I have no RAID and a disk loss would be horrible.


Have to be careful doing that too or you'll end up with subtly different revisions of the same model. This may or may not cause problems depending on the drives/controller/workload but can result in you chasing down weird performance gremlins or thinking you have a drive that's going bad.


That's why serious SAN vendors take care to provide you a mix of disks (e.g. on a brand new NetApp you can see that disks are of 2-3 different types, and with quite different serial numbers).


Or even if the power supplies were purchased around the same time. I had a batch of servers that as soon as they arrived started chewing through hard drives. It took about 10 failed drives before I realized it was a problem with the power supplies.


I learned this principle by getting a ticket for a burnt out headlight 1 week after I replaced the other one.


Anyone familiar with car repair will tell you that if one headlight burns out you should just go ahead and replace both, because of this exact phenomenon. I suppose with LEDs we may not have to worry about it anymore


Even if they're not the same, they're written at the same time and rate, meaning they have the same wear over time, subject to the same power/heat issues, etc.


Hopefully, regularly checking the disks' S.M.A.R.T status will help you stay on top of issues caused by those factors.

Also, you shouldn't wait for disks to fail to replace them. HN's disks were used for 4.5 years, which is greater than the typical disk lifetime, in my experience. They should have replaced them sooner, one by one, in anticipation of failure. This would also allow them to stagger their disk purchases to avoid similar manufacturing dates.


https://news.ycombinator.com/reply?id=32033520&goto=item%3Fi...

I've seen too many dead disks with a perfect SMART. When the numbers go down (or up) and triggers are fired then you are surely need to replace the disk[0], but SMART without warnings just means nothing.

[0] my desktop run for years entirely on the disks removed from the client PCs after a failure. Some of them had a pretty bad SMART, on a couple I needed to move the starting point of the partition a couple GBs further from the sector 0 (otherwise they would stall pretty soon), but overall they worked fine - but I never used them as a reliable storage and I knew I can lose them anytime.

Of course I don't use repurposed drives in the servers.

PS and when I tried to post it I received " We're having some trouble serving your request. Sorry! " Sheesh.


> Double disk failure is improbable but not impossible.

It's actually surprisingly common for failover hardware to fail shortly after the primary hardware. It's normally been exposed to similar conditions to what killed the primary and the strain of failing over pushes it over the edge.


Isn't that more for load balancing than failover?

For load balancing I would consider this very likely because both are equally loaded. But "failover" I would usually consider a scenario where a second server is purely in wait for the primary to fail, in which case it would be virtually unused. Like an active/passive scenario as someone mentioned below.

But perhaps I got my terminology mixed up. I'm not working with servers so much anymore.


If it's active/active failover then they get the same wear, if it's active/passive most of the components don't, but the storage might. Then again if it's active/passive, flaws can "hibernate" and get triggered exactly when failing over.

You know how they say to always test your backups? Always test your failover too.


>the failover server also failed

Those responsible for the sacking have also been sacked.


According to this comment: https://news.ycombinator.com/item?id=32024485

each server has a pair of mirrored disks, so it seems we're talking about 4 drives failing, not just 2.

On the other hand the primary seems to have gone down 6 hours before the backup server did, so the failures weren't quite simultaneous.


> so it seems we're talking about 4 drives failing, not just 2.

Yes—I'm a bit unclear on what happened there, but that does seem to be the case.


If you have an active/passive HA setup and don't test it periodically (by taking the active server offline and switching them afterwards), my guess is that double disk failures will be more common than single disk failures for you.

Still, I see no reason for prioritizing that failure mode on a site like HN.


Depends on your vendor as well.

A long time ago we had a Dell server which was pre setup raid from Dell (don't ask, I didn't order it). Eventually one disk on this server died, what sucked was that the second disk in the raid array also failed only a few minutes later. We had to restore from backup which sucked but to our surprise when we opened the Dell server the two disks had sequential serial numbers. They came from the same batch at the same time. Not a good thing to do when you sell people pre configured raid systems at a mark up...


By second disk failure do they mean that the disks on both the primary and fallback servers failed? Or do they mean that two disks (of a RAID1 or similar setup) in the fallback server failed?

The latter is understandable, the former would be quite a surprise for such a popular site. That means that the machines have no disk redundancy and the server is going down immediately on disk failure. The fallback server would be the only backup.


14 hours ago HN failed over to the standby due to a disk failure on the primary. 8 hours ago the standby's disk also failed.

Primary failure: https://news.ycombinator.com/item?id=32024036 Standby failure: https://twitter.com/HNStatus/status/1545409429113229312


The disks on both the primary and fallback servers definitely failed. Each was in a RAID setup, but those failed too in both cases.


Ouch! I'm assuming the disks were from the same batch and installed at the same time, but having at least four fail like that is just crazy unlucky.


Veteran programmer and HN user kabdib has proposed a striking theory that could explain everything: https://news.ycombinator.com/item?id=32028511.


What was the test to determine the dataloss?


Informal. My last upvote was pretty close to when HN went down, so I expected my karma to go down, but it didn't.

Also I remember the "Why we're going with Rails" story on the front page from before it went down.


I came to the same conclusion by observing that there are posts and comments from only eight hours ago.


So that means dataloss.. Probably restored from backup.

Good news for people who were banned, or for posts that didn't get enough momentum :)

edit: Was restored from backup.. so def. dataloss


8 hours of downtime, but not data loss, since there was no data to lose during the downtime.

Last post before we went down (2022-07-08 12:46:04 UTC): https://news.ycombinator.com/item?id=32026565

First post once we were back up (2022-07-08 20:30:55 UTC): https://news.ycombinator.com/item?id=32026571 (hey, that's this thread! how'd you do that, tpmx?)

So, 7h 45m of downtime. What we don't know is how many posts (or votes, etc.) happened after our last backup, and were therefore lost. The latest vote we have was at 2022-07-08 12:46:05 UTC, which is about the same as the last post.

There can't be many lost posts or votes, though, because I checked HN Search (https://hn.algolia.com/) just before we brought HN back up, and their most recent comment and story were behind ours. That means our last backup on the ill-fated server was taken after the last API update (HN Search relies on our API), and the API gets updated every 30 seconds.

I'm not saying that's a rock-solid argument, but it suggests that 30 seconds is an upper bound on how much data we lost.


Curiosity got the better of me. Why was there a 6 ID gap between the last post and first post? The answer seems to be that admins were making posts, which is neat. (There was also one lonely Flexport job ad.)

Is your backup system tied to your API? Algolia is a third party service, and streaming the latest HN data to Algolia seems pretty similar to streaming it to a backup system.


I posted a bunch of test things and then deleted them.


I love this answer so much.


I really wanted to ask “How did you post things if the server was down?” but perhaps some things are better left as mysteries.


You could see them via HN’s API before they were deleted, nothing interesting; API was back up before the www.


Good observation. Posting something and then seeing it show up in the API was one of the things we were testing. It exercises a lot of the code.


The server was up for us before it was up for everybody else.


i got that Flexport ad too.. haha kinda alarming if they are the only YC company still hiring


Btw, job ads get queued long in advance and then the software picks the next one when it's time for a job ad. After 8 hours of being down, the software thought it was time for a job ad.


> So that means dataloss.. Probably restored from backup.

If the server went down at XX:XX, and the backup they restored from is also from XX:XX, there isn't dataloss. If the server was down for 8 hours, the last data being 8 hours old isn't dataloss, it's correct.


I'm extremely curious about the makes & models of the failed hardware...


> Double disk failure is improbable but not impossible.

Were they connected on the same power supply? I had 4 different disks fail at the same time before, but they were all in the same PC... (lightning)


They were in two mirrors, each mirror in a different server. Each server in different racks in the same row. The servers were on different power circuits from different panels.


[deleted]


CP => !A


is dang pushing changes and such on his own?

sounds like it is run by one guy


I push changes on my own all the time, but the work of getting HN running again today was overwhelmingly done by my brilliant colleague mthurman.


What makes you think that? That's just a tweet from an unrelated account.


Nevermind, I thought the OP ran that twitter account


HN will be around a hundred years. I think it's more than just a forum. We've seen lots of people coordinate during disasters, for example. Dan and his team do a good job running it. (I'm not a part of it.)

EDIT: My response was based on some edits that are now removed.


You are overstimating HN way too much.


A hundred years, I give it 10 tops.


It’s already been around since 2007. How many decades does HN need to be around before people realize it’s an institution?


The reason it's an institution is because it hasn't been bought by some corp trying to squeeze value out of eyeballs, which is why it hasn't really changed much.

However, it takes money and time to keep it around in a not for profit way, so it will be an institution as long as it's funding is the same.


Yeah I really hope that if Ycombinator ever wants to pull out, that they don't sell it but let the community pull together to support it. I'd gladly donate to keep it running as it is.

It would be even better if they just keep doing it as they are though <3


Considering the hardware it's running on, it wouldn't cost much to keep it running.


More the labor I'd assume. Development to continue scaling, maintenance and moderation.


Slashdot has been around since 1997 and people still rave about its moderation system today. However, while I have high hopes for HN, it could very well go the way of digg overnight


I doubt that though. Digg was hyped way too much and the inevitable decline that comes after a hype killed it. Some things are good enough to survive that phase but Digg wasn't. HN never had a hype phase, just slow but strong & steady growth. And not growing too much either.

It seems the perfect circumstances to really last. It doesn't have an invasive business model, or investors screaming for ROI either. That's the kind of thing that often leads to user-hostile changes that so often start the decline into oblivion.

Also, I would imagine it's pretty cheap to host, after all it's all very simple text, I don't think it hosts any pictures beside the little Ycombinator logo in the corner :)


Change of ownership is inevitable, as people don't live forever. When that happens, if the new owners aren't interested or motivated to keep funding HN, it could easily go the way of the dodo.

Hopefully archive.org is involved in archiving HN, though unfortunately archive.org's future itself is in jeopardy.


> though unfortunately archive.org's future itself is in jeopardy.

How so?? This is the first I've heard of it.


It's truly tragic how few people have heard of this. We should be doing all we can to raise awareness of this lawsuit before it's too late.

Here are some relevant links:

https://news.ycombinator.com/item?id=31703394

https://decrypt.co/31906/activists-rally-save-internet-archi...

https://www.courtlistener.com/docket/17211300/hachette-book-...


Thanks! I heard of this back in the day but I didn't know it reached the courts now. I feel this is a bit self-inflicted though. A valuable service like this is already under scrutiny for copyrights and playing chicken with big publishers with tons of money to spend on lawyers is a really bad idea. I'm also opposed to the way copyright works but I would separate that fight from the service.

I guess it got them some goodwill during Corona but it could cause more damage than it's worth.

I wouldn't have done it, it was not like it was a real value during the pandemic. Those who are really into books and don't care about copyright already know their way to more gray-area sites like LibGen.


it would be super ironic if reddit acquired it somehow


theres two people fulltime on it but dang appears to be both DBA and SRE


And Mod; hope he gets three cheques




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: