Checkpoint/restart in practice: When 'simple is better'

Nosayba El-Sayed, Bianca Schroeder

Research output: Chapter in Book/Report/Conference proceedingConference contribution

11 Citations (Scopus)

Abstract

Efficient use of high-performance computing (HPC) installations critically relies on effective methods for fault tolerance. The most commonly used method is checkpoint/restart, where an application writes periodic checkpoints of its state to stable storage that it can restart from in the case of a failure. Despite the prevalence of checkpoint/restart, it is still not very well understood in practice how to set its key parameter, the checkpoint interval. Despite a large body of theoretical work, practitioners still rely on crude rules-of-thumb such as 'checkpoint once every hour'. Our goal is to identify methods for optimizing the checkpointing process that are easy to use in practice and at the same time achieve high quality solutions. In particular, our paper makes the following contributions: We evaluate an array of methods for optimizing the checkpoint interval, some previously known as well as new ones that we propose, using real-world failure logs. We show that a very simple closed-form solution can easily be adapted for use in practice and achieves near-optimal performance. We also find that more complex solutions only negligibly improve performance based on real world traces. We show that simple back-of-the envelope formulas can be used to accurately estimate the wasted work in HPC systems, and make projections of future HPC systems and requirements for their efficient use.

Original languageEnglish
Title of host publication2014 IEEE International Conference on Cluster Computing, CLUSTER 2014
PublisherInstitute of Electrical and Electronics Engineers Inc.
Pages84-92
Number of pages9
ISBN (Electronic)9781479955480
DOIs
Publication statusPublished - 1 Jan 2014
Externally publishedYes
Event16th IEEE International Conference on Cluster Computing, CLUSTER 2014 - Madrid, Spain
Duration: 22 Sep 201426 Sep 2014

Other

Other16th IEEE International Conference on Cluster Computing, CLUSTER 2014
CountrySpain
CityMadrid
Period22/9/1426/9/14

Fingerprint

Fault tolerance

Keywords

  • Checkpoint-restart
  • Fault tolerance
  • High-performance computing

ASJC Scopus subject areas

  • Computational Theory and Mathematics
  • Computer Science Applications
  • Software

Cite this

El-Sayed, N., & Schroeder, B. (2014). Checkpoint/restart in practice: When 'simple is better'. In 2014 IEEE International Conference on Cluster Computing, CLUSTER 2014 (pp. 84-92). [6968777] Institute of Electrical and Electronics Engineers Inc.. https://doi.org/10.1109/CLUSTER.2014.6968777

Checkpoint/restart in practice : When 'simple is better'. / El-Sayed, Nosayba; Schroeder, Bianca.

2014 IEEE International Conference on Cluster Computing, CLUSTER 2014. Institute of Electrical and Electronics Engineers Inc., 2014. p. 84-92 6968777.

Research output: Chapter in Book/Report/Conference proceedingConference contribution

El-Sayed, N & Schroeder, B 2014, Checkpoint/restart in practice: When 'simple is better'. in 2014 IEEE International Conference on Cluster Computing, CLUSTER 2014., 6968777, Institute of Electrical and Electronics Engineers Inc., pp. 84-92, 16th IEEE International Conference on Cluster Computing, CLUSTER 2014, Madrid, Spain, 22/9/14. https://doi.org/10.1109/CLUSTER.2014.6968777
El-Sayed N, Schroeder B. Checkpoint/restart in practice: When 'simple is better'. In 2014 IEEE International Conference on Cluster Computing, CLUSTER 2014. Institute of Electrical and Electronics Engineers Inc. 2014. p. 84-92. 6968777 https://doi.org/10.1109/CLUSTER.2014.6968777
El-Sayed, Nosayba ; Schroeder, Bianca. / Checkpoint/restart in practice : When 'simple is better'. 2014 IEEE International Conference on Cluster Computing, CLUSTER 2014. Institute of Electrical and Electronics Engineers Inc., 2014. pp. 84-92
@inproceedings{e708edd829834af4b67eb833489e60dc,
title = "Checkpoint/restart in practice: When 'simple is better'",
abstract = "Efficient use of high-performance computing (HPC) installations critically relies on effective methods for fault tolerance. The most commonly used method is checkpoint/restart, where an application writes periodic checkpoints of its state to stable storage that it can restart from in the case of a failure. Despite the prevalence of checkpoint/restart, it is still not very well understood in practice how to set its key parameter, the checkpoint interval. Despite a large body of theoretical work, practitioners still rely on crude rules-of-thumb such as 'checkpoint once every hour'. Our goal is to identify methods for optimizing the checkpointing process that are easy to use in practice and at the same time achieve high quality solutions. In particular, our paper makes the following contributions: We evaluate an array of methods for optimizing the checkpoint interval, some previously known as well as new ones that we propose, using real-world failure logs. We show that a very simple closed-form solution can easily be adapted for use in practice and achieves near-optimal performance. We also find that more complex solutions only negligibly improve performance based on real world traces. We show that simple back-of-the envelope formulas can be used to accurately estimate the wasted work in HPC systems, and make projections of future HPC systems and requirements for their efficient use.",
keywords = "Checkpoint-restart, Fault tolerance, High-performance computing",
author = "Nosayba El-Sayed and Bianca Schroeder",
year = "2014",
month = "1",
day = "1",
doi = "10.1109/CLUSTER.2014.6968777",
language = "English",
pages = "84--92",
booktitle = "2014 IEEE International Conference on Cluster Computing, CLUSTER 2014",
publisher = "Institute of Electrical and Electronics Engineers Inc.",

}

TY - GEN

T1 - Checkpoint/restart in practice

T2 - When 'simple is better'

AU - El-Sayed, Nosayba

AU - Schroeder, Bianca

PY - 2014/1/1

Y1 - 2014/1/1

N2 - Efficient use of high-performance computing (HPC) installations critically relies on effective methods for fault tolerance. The most commonly used method is checkpoint/restart, where an application writes periodic checkpoints of its state to stable storage that it can restart from in the case of a failure. Despite the prevalence of checkpoint/restart, it is still not very well understood in practice how to set its key parameter, the checkpoint interval. Despite a large body of theoretical work, practitioners still rely on crude rules-of-thumb such as 'checkpoint once every hour'. Our goal is to identify methods for optimizing the checkpointing process that are easy to use in practice and at the same time achieve high quality solutions. In particular, our paper makes the following contributions: We evaluate an array of methods for optimizing the checkpoint interval, some previously known as well as new ones that we propose, using real-world failure logs. We show that a very simple closed-form solution can easily be adapted for use in practice and achieves near-optimal performance. We also find that more complex solutions only negligibly improve performance based on real world traces. We show that simple back-of-the envelope formulas can be used to accurately estimate the wasted work in HPC systems, and make projections of future HPC systems and requirements for their efficient use.

AB - Efficient use of high-performance computing (HPC) installations critically relies on effective methods for fault tolerance. The most commonly used method is checkpoint/restart, where an application writes periodic checkpoints of its state to stable storage that it can restart from in the case of a failure. Despite the prevalence of checkpoint/restart, it is still not very well understood in practice how to set its key parameter, the checkpoint interval. Despite a large body of theoretical work, practitioners still rely on crude rules-of-thumb such as 'checkpoint once every hour'. Our goal is to identify methods for optimizing the checkpointing process that are easy to use in practice and at the same time achieve high quality solutions. In particular, our paper makes the following contributions: We evaluate an array of methods for optimizing the checkpoint interval, some previously known as well as new ones that we propose, using real-world failure logs. We show that a very simple closed-form solution can easily be adapted for use in practice and achieves near-optimal performance. We also find that more complex solutions only negligibly improve performance based on real world traces. We show that simple back-of-the envelope formulas can be used to accurately estimate the wasted work in HPC systems, and make projections of future HPC systems and requirements for their efficient use.

KW - Checkpoint-restart

KW - Fault tolerance

KW - High-performance computing

UR - http://www.scopus.com/inward/record.url?scp=84917706033&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=84917706033&partnerID=8YFLogxK

U2 - 10.1109/CLUSTER.2014.6968777

DO - 10.1109/CLUSTER.2014.6968777

M3 - Conference contribution

AN - SCOPUS:84917706033

SP - 84

EP - 92

BT - 2014 IEEE International Conference on Cluster Computing, CLUSTER 2014

PB - Institute of Electrical and Electronics Engineers Inc.

ER -