General question to people who might actually know.
Is there anywhere that does anything like Backblaze's Hard Drive Failure Rates [1] for GPU Failure Rates in environments like data centers, high-performance computing, super-computers, mainframes?
The best that came back on a search was a semi-modern article from 2023 [2] that appears to be a one-off and mostly related to consumer facing GPU purchases, rather than bulk data center, constant usage conditions. It's just difficult to really believe some of these kinds of hardware deprecation numbers since there appears to be so little info other than guesstimates.
Found an Arxiv paper on continued checking that's from UIUC UrbanaIL about a 1,056 A100 and H100 GPU system. [3] However, the paper is primarily about memory issues and per/job downtime that causes task failures and work loss. GPU Resilience is discussed, it's just mostly from the perspective of short-term robustness in the face of propagating memory corruption issues and error correction, rather than multi-year, 100% usage GPU burnout rates.
Any info on the longer term burnout / failure rates for GPUs similar to Backblaze?
Edit: This article [4] claims it's 0.1-2% failure rate per year (0.8% (estimated)) with no real info about where the data came from (cites "industry reports and data center statistics"), and then claims they often last 3-5 years on average.
Is there anywhere that does anything like Backblaze's Hard Drive Failure Rates [1] for GPU Failure Rates in environments like data centers, high-performance computing, super-computers, mainframes?
[1] https://www.backblaze.com/blog/backblaze-drive-stats-for-q3-...
The best that came back on a search was a semi-modern article from 2023 [2] that appears to be a one-off and mostly related to consumer facing GPU purchases, rather than bulk data center, constant usage conditions. It's just difficult to really believe some of these kinds of hardware deprecation numbers since there appears to be so little info other than guesstimates.
[2] https://www.tweaktown.com/news/93052/heres-look-at-gpu-failu...
Found an Arxiv paper on continued checking that's from UIUC UrbanaIL about a 1,056 A100 and H100 GPU system. [3] However, the paper is primarily about memory issues and per/job downtime that causes task failures and work loss. GPU Resilience is discussed, it's just mostly from the perspective of short-term robustness in the face of propagating memory corruption issues and error correction, rather than multi-year, 100% usage GPU burnout rates.
[3] https://arxiv.org/html/2503.11901v3
Any info on the longer term burnout / failure rates for GPUs similar to Backblaze?
Edit: This article [4] claims it's 0.1-2% failure rate per year (0.8% (estimated)) with no real info about where the data came from (cites "industry reports and data center statistics"), and then claims they often last 3-5 years on average.
[4] https://cyfuture.cloud/kb/gpu/what-is-the-failure-rate-of-th...