- #1
- 8,143
- 1,761
http://www.youtube.com/watch?v=p2FWMO2QonY&feature=dirABSTRACT
Component failure in large-scale IT installations is becoming an ever larger problem as the number of processors, memory chips, and disks in a single cluster approaches a million. Yet, virtually no data on failures in real systems is publicly available, forcing researchers to base their work on anecdotes and back of the envelope calculations. In this talk, we will present results from our analysis of failure data from 26 large-scale production systems at three different organizations, including two high-performance computing sites and one large internet service provider.