Collecting, Analysing, and Exploiting Failure Data from large IT installations

In summary, the presenter discussed the results of their analysis of failure data from 26 large-scale production systems. The data showed that node failure rates were dependent on the type of activity being performed, and not just the level of activity at a node.
  • #1
Ivan Seeking
Staff Emeritus
Science Advisor
Gold Member
8,143
1,761
ABSTRACT
Component failure in large-scale IT installations is becoming an ever larger problem as the number of processors, memory chips, and disks in a single cluster approaches a million. Yet, virtually no data on failures in real systems is publicly available, forcing researchers to base their work on anecdotes and back of the envelope calculations. In this talk, we will present results from our analysis of failure data from 26 large-scale production systems at three different organizations, including two high-performance computing sites and one large internet service provider.
http://www.youtube.com/watch?v=p2FWMO2QonY&feature=dir
 
Computer science news on Phys.org
  • #2
I can't watch Youtube here, but this sounds exactly analogous to machinery component monitoring. I am really surprised no one really does that in the IT setting.
 
  • #3
AAARGHH, who let this woman speak? It's like listening to someone drag their fingernails over a chalkboard. She's so nervous about speaking it's painful. I had to shut it off.

Ivan, does she end up making any points? You're a better person than I am if you could sit through almost an hour of this.
 
  • #4
Yes, I thought she made a number of interesting points, but I'm not an IT person, and I have no idea how much might be common knowledge to a pro.
 
  • #5
I didn't watch the video but i used to run Beowulf clusters ( lots of desktop computers wired together into one big computer)
Numbers do come back to bite you, if you have a hard drive with an average life of 3years ( 150weeks) but you have an cluster with 150 machines you can expect to be replacing a disk every week.
In practice because we are using cheap home machoines we are also not cooling them properly ( large AC is expensive 0 so we had even more failures than you would expect.

Google used to claim that with their clusters of several 1000s machines it wasn't worth even finding the broken machine and turning it off never mind trying to fix it.

For most real installations you tend to use fewer higher powered better engineered servers instead of 1000s of PCs which are easier to monitor and manage - in fact an increasingly common technique is to use virtual machine software to run many independant copies of machines on a single large machine.
 
  • #6
Evo said:
You're a better person than I am if you could sit through almost an hour of this.

She is a bit like Julia Child with a german accent.

After twenty years of marriage, I can handle anything! :biggrin: :rolleyes:

I thought that one of the more interesting points was that the node failure rate was dependent on the type of activity, and not just the level of activity at a node.
 
Last edited:

FAQ: Collecting, Analysing, and Exploiting Failure Data from large IT installations

What is the importance of collecting failure data from large IT installations?

Collecting failure data from large IT installations is important because it allows organizations to understand the root causes of system failures and make informed decisions to prevent them in the future. It also helps in identifying patterns and trends that can lead to more efficient and effective system maintenance.

How should failure data be collected and stored?

Failure data should be collected and stored in a structured and organized manner to ensure easy retrieval and analysis. This can be done through automated tools and systems that continuously gather data from various sources such as logs, performance metrics, and user feedback.

What are the key steps in analyzing failure data?

The key steps in analyzing failure data include identifying the scope and purpose of the analysis, cleaning and preparing the data, performing statistical analysis and visualization, and drawing conclusions and making recommendations based on the findings. It is important to involve subject matter experts and use appropriate analytical techniques for accurate and meaningful insights.

How can organizations exploit failure data to improve their IT installations?

Organizations can exploit failure data by using it to identify and prioritize potential system vulnerabilities, proactively address emerging issues, and make informed decisions on system upgrades and changes. It can also be used to improve system performance and reliability, leading to cost savings and increased user satisfaction.

What are the potential challenges in collecting and analyzing failure data?

Some potential challenges in collecting and analyzing failure data include data quality issues, lack of a standardized framework for data collection and analysis, and the need for specialized skills and tools. It is also important to consider data privacy and security concerns when handling sensitive failure data.

Back
Top