Learning from Failure Across Multiple Clusters

A Trace-Driven Approach to Understanding, Predicting, and Mitigating Job Terminations

Nosayba El-Sayed, Hongyu Zhu, Bianca Schroeder

Research output: Chapter in Book/Report/Conference proceedingConference contribution

2 Citations (Scopus)

Abstract

In large-scale computing platforms, jobs are prone to interruptions and premature terminations, limiting their usability and leading to significant waste in cluster resources. In this paper, we tackle this problem in three steps. First, we provide a comprehensive study based on log data from multiple large-scale production systems to identify patterns in the behaviour of unsuccessful jobs across different clusters and investigate possible root causes behind job termination. Our results reveal several interesting properties that distinguish unsuccessful jobs from others, particularly w.r.t. resource consumption patterns and job configuration settings. Secondly, we design a machine learning-based framework for predicting job and task terminations. We show that job failures can be predicted relatively early with high precision and recall, and also identify attributes that have strong predictive power of job failure. Finally, we demonstrate in a concrete use case how our prediction framework can be used to mitigate the effect of unsuccessful execution using an effective task-cloning policy that we propose.

Original languageEnglish
Title of host publicationProceedings - IEEE 37th International Conference on Distributed Computing Systems, ICDCS 2017
PublisherInstitute of Electrical and Electronics Engineers Inc.
Pages1333-1344
Number of pages12
ISBN (Electronic)9781538617915
DOIs
Publication statusPublished - 13 Jul 2017
Externally publishedYes
Event37th IEEE International Conference on Distributed Computing Systems, ICDCS 2017 - Atlanta, United States
Duration: 5 Jun 20178 Jun 2017

Other

Other37th IEEE International Conference on Distributed Computing Systems, ICDCS 2017
CountryUnited States
CityAtlanta
Period5/6/178/6/17

Fingerprint

Cloning
Learning systems

Keywords

  • Failure Mitigation
  • Failure Prediction
  • Job Failure
  • Large-Scale Systems
  • Reliability
  • Trace Analysis

ASJC Scopus subject areas

  • Software
  • Hardware and Architecture
  • Computer Networks and Communications

Cite this

El-Sayed, N., Zhu, H., & Schroeder, B. (2017). Learning from Failure Across Multiple Clusters: A Trace-Driven Approach to Understanding, Predicting, and Mitigating Job Terminations. In Proceedings - IEEE 37th International Conference on Distributed Computing Systems, ICDCS 2017 (pp. 1333-1344). [7980073] Institute of Electrical and Electronics Engineers Inc.. https://doi.org/10.1109/ICDCS.2017.317

Learning from Failure Across Multiple Clusters : A Trace-Driven Approach to Understanding, Predicting, and Mitigating Job Terminations. / El-Sayed, Nosayba; Zhu, Hongyu; Schroeder, Bianca.

Proceedings - IEEE 37th International Conference on Distributed Computing Systems, ICDCS 2017. Institute of Electrical and Electronics Engineers Inc., 2017. p. 1333-1344 7980073.

Research output: Chapter in Book/Report/Conference proceedingConference contribution

El-Sayed, N, Zhu, H & Schroeder, B 2017, Learning from Failure Across Multiple Clusters: A Trace-Driven Approach to Understanding, Predicting, and Mitigating Job Terminations. in Proceedings - IEEE 37th International Conference on Distributed Computing Systems, ICDCS 2017., 7980073, Institute of Electrical and Electronics Engineers Inc., pp. 1333-1344, 37th IEEE International Conference on Distributed Computing Systems, ICDCS 2017, Atlanta, United States, 5/6/17. https://doi.org/10.1109/ICDCS.2017.317
El-Sayed N, Zhu H, Schroeder B. Learning from Failure Across Multiple Clusters: A Trace-Driven Approach to Understanding, Predicting, and Mitigating Job Terminations. In Proceedings - IEEE 37th International Conference on Distributed Computing Systems, ICDCS 2017. Institute of Electrical and Electronics Engineers Inc. 2017. p. 1333-1344. 7980073 https://doi.org/10.1109/ICDCS.2017.317
El-Sayed, Nosayba ; Zhu, Hongyu ; Schroeder, Bianca. / Learning from Failure Across Multiple Clusters : A Trace-Driven Approach to Understanding, Predicting, and Mitigating Job Terminations. Proceedings - IEEE 37th International Conference on Distributed Computing Systems, ICDCS 2017. Institute of Electrical and Electronics Engineers Inc., 2017. pp. 1333-1344
@inproceedings{37a22b24f4a04bfb911c5e5dc0a44ce7,
title = "Learning from Failure Across Multiple Clusters: A Trace-Driven Approach to Understanding, Predicting, and Mitigating Job Terminations",
abstract = "In large-scale computing platforms, jobs are prone to interruptions and premature terminations, limiting their usability and leading to significant waste in cluster resources. In this paper, we tackle this problem in three steps. First, we provide a comprehensive study based on log data from multiple large-scale production systems to identify patterns in the behaviour of unsuccessful jobs across different clusters and investigate possible root causes behind job termination. Our results reveal several interesting properties that distinguish unsuccessful jobs from others, particularly w.r.t. resource consumption patterns and job configuration settings. Secondly, we design a machine learning-based framework for predicting job and task terminations. We show that job failures can be predicted relatively early with high precision and recall, and also identify attributes that have strong predictive power of job failure. Finally, we demonstrate in a concrete use case how our prediction framework can be used to mitigate the effect of unsuccessful execution using an effective task-cloning policy that we propose.",
keywords = "Failure Mitigation, Failure Prediction, Job Failure, Large-Scale Systems, Reliability, Trace Analysis",
author = "Nosayba El-Sayed and Hongyu Zhu and Bianca Schroeder",
year = "2017",
month = "7",
day = "13",
doi = "10.1109/ICDCS.2017.317",
language = "English",
pages = "1333--1344",
booktitle = "Proceedings - IEEE 37th International Conference on Distributed Computing Systems, ICDCS 2017",
publisher = "Institute of Electrical and Electronics Engineers Inc.",

}

TY - GEN

T1 - Learning from Failure Across Multiple Clusters

T2 - A Trace-Driven Approach to Understanding, Predicting, and Mitigating Job Terminations

AU - El-Sayed, Nosayba

AU - Zhu, Hongyu

AU - Schroeder, Bianca

PY - 2017/7/13

Y1 - 2017/7/13

N2 - In large-scale computing platforms, jobs are prone to interruptions and premature terminations, limiting their usability and leading to significant waste in cluster resources. In this paper, we tackle this problem in three steps. First, we provide a comprehensive study based on log data from multiple large-scale production systems to identify patterns in the behaviour of unsuccessful jobs across different clusters and investigate possible root causes behind job termination. Our results reveal several interesting properties that distinguish unsuccessful jobs from others, particularly w.r.t. resource consumption patterns and job configuration settings. Secondly, we design a machine learning-based framework for predicting job and task terminations. We show that job failures can be predicted relatively early with high precision and recall, and also identify attributes that have strong predictive power of job failure. Finally, we demonstrate in a concrete use case how our prediction framework can be used to mitigate the effect of unsuccessful execution using an effective task-cloning policy that we propose.

AB - In large-scale computing platforms, jobs are prone to interruptions and premature terminations, limiting their usability and leading to significant waste in cluster resources. In this paper, we tackle this problem in three steps. First, we provide a comprehensive study based on log data from multiple large-scale production systems to identify patterns in the behaviour of unsuccessful jobs across different clusters and investigate possible root causes behind job termination. Our results reveal several interesting properties that distinguish unsuccessful jobs from others, particularly w.r.t. resource consumption patterns and job configuration settings. Secondly, we design a machine learning-based framework for predicting job and task terminations. We show that job failures can be predicted relatively early with high precision and recall, and also identify attributes that have strong predictive power of job failure. Finally, we demonstrate in a concrete use case how our prediction framework can be used to mitigate the effect of unsuccessful execution using an effective task-cloning policy that we propose.

KW - Failure Mitigation

KW - Failure Prediction

KW - Job Failure

KW - Large-Scale Systems

KW - Reliability

KW - Trace Analysis

UR - http://www.scopus.com/inward/record.url?scp=85027266991&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=85027266991&partnerID=8YFLogxK

U2 - 10.1109/ICDCS.2017.317

DO - 10.1109/ICDCS.2017.317

M3 - Conference contribution

SP - 1333

EP - 1344

BT - Proceedings - IEEE 37th International Conference on Distributed Computing Systems, ICDCS 2017

PB - Institute of Electrical and Electronics Engineers Inc.

ER -