Checkpoint/restart in practice: When 'simple is better'

Nosayba El-Sayed, Bianca Schroeder

Research output: Chapter in Book/Report/Conference proceedingConference contribution

11 Citations (Scopus)

Abstract

Efficient use of high-performance computing (HPC) installations critically relies on effective methods for fault tolerance. The most commonly used method is checkpoint/restart, where an application writes periodic checkpoints of its state to stable storage that it can restart from in the case of a failure. Despite the prevalence of checkpoint/restart, it is still not very well understood in practice how to set its key parameter, the checkpoint interval. Despite a large body of theoretical work, practitioners still rely on crude rules-of-thumb such as 'checkpoint once every hour'. Our goal is to identify methods for optimizing the checkpointing process that are easy to use in practice and at the same time achieve high quality solutions. In particular, our paper makes the following contributions: We evaluate an array of methods for optimizing the checkpoint interval, some previously known as well as new ones that we propose, using real-world failure logs. We show that a very simple closed-form solution can easily be adapted for use in practice and achieves near-optimal performance. We also find that more complex solutions only negligibly improve performance based on real world traces. We show that simple back-of-the envelope formulas can be used to accurately estimate the wasted work in HPC systems, and make projections of future HPC systems and requirements for their efficient use.

Original languageEnglish
Title of host publication2014 IEEE International Conference on Cluster Computing, CLUSTER 2014
PublisherInstitute of Electrical and Electronics Engineers Inc.
Pages84-92
Number of pages9
ISBN (Electronic)9781479955480
DOIs
Publication statusPublished - 1 Jan 2014
Externally publishedYes
Event16th IEEE International Conference on Cluster Computing, CLUSTER 2014 - Madrid, Spain
Duration: 22 Sep 201426 Sep 2014

Other

Other16th IEEE International Conference on Cluster Computing, CLUSTER 2014
CountrySpain
CityMadrid
Period22/9/1426/9/14

    Fingerprint

Keywords

  • Checkpoint-restart
  • Fault tolerance
  • High-performance computing

ASJC Scopus subject areas

  • Computational Theory and Mathematics
  • Computer Science Applications
  • Software

Cite this

El-Sayed, N., & Schroeder, B. (2014). Checkpoint/restart in practice: When 'simple is better'. In 2014 IEEE International Conference on Cluster Computing, CLUSTER 2014 (pp. 84-92). [6968777] Institute of Electrical and Electronics Engineers Inc.. https://doi.org/10.1109/CLUSTER.2014.6968777