Reading between the lines of failure logs: Understanding how HPC systems fail

Nosayba El-Sayed, Bianca Schroeder

Research output: Chapter in Book/Report/Conference proceedingConference contribution

40 Citations (Scopus)

Abstract

As the component count in supercomputing installations continues to increase, system reliability is becoming one of the major issues in designing HPC systems. These issues will become more challenging in future Exascale systems, which are predicted to include millions of CPU cores. Even with relatively reliable individual components, the sheer number of components will increase failure rates to unprecedented levels. Efficiently running those systems will require a good understanding of how different factors impact system reliability. In this paper we use a decade worth of field data made available by Los Alamos National Lab to study the impact of a diverse set of factors on the reliability of HPC systems. We provide insights into the nature of correlations between failures, and investigate the impact of factors, such as the power quality, temperature, fan and chiller reliability, system usage and utilization, and external factors, such as cosmic radiation, on system reliability.

Original languageEnglish
Title of host publication2013 43rd Annual IEEE/IFIP International Conference on Dependable Systems and Networks, DSN 2013
DOIs
Publication statusPublished - 9 Sep 2013
Externally publishedYes
Event2013 43rd Annual IEEE/IFIP International Conference on Dependable Systems and Networks, DSN 2013 - Budapest, Hungary
Duration: 24 Jun 201327 Jun 2013

Other

Other2013 43rd Annual IEEE/IFIP International Conference on Dependable Systems and Networks, DSN 2013
CountryHungary
CityBudapest
Period24/6/1327/6/13

Fingerprint

Cosmic rays
Power quality
Fans
Program processors
Temperature

ASJC Scopus subject areas

  • Software
  • Hardware and Architecture
  • Computer Networks and Communications

Cite this

El-Sayed, N., & Schroeder, B. (2013). Reading between the lines of failure logs: Understanding how HPC systems fail. In 2013 43rd Annual IEEE/IFIP International Conference on Dependable Systems and Networks, DSN 2013 [6575356] https://doi.org/10.1109/DSN.2013.6575356

Reading between the lines of failure logs : Understanding how HPC systems fail. / El-Sayed, Nosayba; Schroeder, Bianca.

2013 43rd Annual IEEE/IFIP International Conference on Dependable Systems and Networks, DSN 2013. 2013. 6575356.

Research output: Chapter in Book/Report/Conference proceedingConference contribution

El-Sayed, N & Schroeder, B 2013, Reading between the lines of failure logs: Understanding how HPC systems fail. in 2013 43rd Annual IEEE/IFIP International Conference on Dependable Systems and Networks, DSN 2013., 6575356, 2013 43rd Annual IEEE/IFIP International Conference on Dependable Systems and Networks, DSN 2013, Budapest, Hungary, 24/6/13. https://doi.org/10.1109/DSN.2013.6575356
El-Sayed N, Schroeder B. Reading between the lines of failure logs: Understanding how HPC systems fail. In 2013 43rd Annual IEEE/IFIP International Conference on Dependable Systems and Networks, DSN 2013. 2013. 6575356 https://doi.org/10.1109/DSN.2013.6575356
El-Sayed, Nosayba ; Schroeder, Bianca. / Reading between the lines of failure logs : Understanding how HPC systems fail. 2013 43rd Annual IEEE/IFIP International Conference on Dependable Systems and Networks, DSN 2013. 2013.
@inproceedings{473e8dafc455461fb24bbddb711ffba2,
title = "Reading between the lines of failure logs: Understanding how HPC systems fail",
abstract = "As the component count in supercomputing installations continues to increase, system reliability is becoming one of the major issues in designing HPC systems. These issues will become more challenging in future Exascale systems, which are predicted to include millions of CPU cores. Even with relatively reliable individual components, the sheer number of components will increase failure rates to unprecedented levels. Efficiently running those systems will require a good understanding of how different factors impact system reliability. In this paper we use a decade worth of field data made available by Los Alamos National Lab to study the impact of a diverse set of factors on the reliability of HPC systems. We provide insights into the nature of correlations between failures, and investigate the impact of factors, such as the power quality, temperature, fan and chiller reliability, system usage and utilization, and external factors, such as cosmic radiation, on system reliability.",
author = "Nosayba El-Sayed and Bianca Schroeder",
year = "2013",
month = "9",
day = "9",
doi = "10.1109/DSN.2013.6575356",
language = "English",
isbn = "9781467364713",
booktitle = "2013 43rd Annual IEEE/IFIP International Conference on Dependable Systems and Networks, DSN 2013",

}

TY - GEN

T1 - Reading between the lines of failure logs

T2 - Understanding how HPC systems fail

AU - El-Sayed, Nosayba

AU - Schroeder, Bianca

PY - 2013/9/9

Y1 - 2013/9/9

N2 - As the component count in supercomputing installations continues to increase, system reliability is becoming one of the major issues in designing HPC systems. These issues will become more challenging in future Exascale systems, which are predicted to include millions of CPU cores. Even with relatively reliable individual components, the sheer number of components will increase failure rates to unprecedented levels. Efficiently running those systems will require a good understanding of how different factors impact system reliability. In this paper we use a decade worth of field data made available by Los Alamos National Lab to study the impact of a diverse set of factors on the reliability of HPC systems. We provide insights into the nature of correlations between failures, and investigate the impact of factors, such as the power quality, temperature, fan and chiller reliability, system usage and utilization, and external factors, such as cosmic radiation, on system reliability.

AB - As the component count in supercomputing installations continues to increase, system reliability is becoming one of the major issues in designing HPC systems. These issues will become more challenging in future Exascale systems, which are predicted to include millions of CPU cores. Even with relatively reliable individual components, the sheer number of components will increase failure rates to unprecedented levels. Efficiently running those systems will require a good understanding of how different factors impact system reliability. In this paper we use a decade worth of field data made available by Los Alamos National Lab to study the impact of a diverse set of factors on the reliability of HPC systems. We provide insights into the nature of correlations between failures, and investigate the impact of factors, such as the power quality, temperature, fan and chiller reliability, system usage and utilization, and external factors, such as cosmic radiation, on system reliability.

UR - http://www.scopus.com/inward/record.url?scp=84883367588&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=84883367588&partnerID=8YFLogxK

U2 - 10.1109/DSN.2013.6575356

DO - 10.1109/DSN.2013.6575356

M3 - Conference contribution

AN - SCOPUS:84883367588

SN - 9781467364713

BT - 2013 43rd Annual IEEE/IFIP International Conference on Dependable Systems and Networks, DSN 2013

ER -