To checkpoint or not to checkpoint: Understanding energy-performance-I/O tradeoffs in HPC checkpointing

Nosayba El-Sayed, Bianca Schroeder

Research output: Chapter in Book/Report/Conference proceedingConference contribution

10 Citations (Scopus)

Abstract

As the scale of high-performance computing (HPC) clusters continues to grow, their increasing failure rates and energy consumption levels are emerging as two serious design concerns that are expected to become more challenging in future Exascale systems. Therefore, efficiently running systems at such large scales requires an in-depth understanding of the performance and energy costs associated with different fault tolerance techniques. The most commonly used fault tolerance method is checkpoint/restart. Over the years, checkpoint scheduling policies have been traditionally optimized and analysed from a performance perspective. Understanding the energy profile of these policies or how to optimize them for energy savings (rather than performance), remain not very well understood. In this paper, we provide an extensive analysis of the energy/ performance tradeoffs associated with an array of checkpoint scheduling policies, including policies that we propose, as well as few existing ones in the literature. We estimate the energy overhead for a given checkpointing policy, and provide simple formulas to optimize checkpoint scheduling for energy savings, with or without a bound on runtime. We then evaluate and compare the runtime-optimized and energy-optimized versions of the different methods using trace driven simulations based on failure logs from 10 production HPC clusters. Our results show ample room for achieving high energy savings with a low runtime overhead when using non-constant (adaptive) checkpointing methods that exploit characteristics of HPC failures. We also analyze the impact of energy-optimized checkpointing on the storage subsystem, identify policies that are more optimal for I/O savings, and study how to optimize for energy with a bound on I/O time.

Original languageEnglish
Title of host publication2014 IEEE International Conference on Cluster Computing, CLUSTER 2014
PublisherInstitute of Electrical and Electronics Engineers Inc.
Pages93-102
Number of pages10
ISBN (Electronic)9781479955480
DOIs
Publication statusPublished - 1 Jan 2014
Externally publishedYes
Event16th IEEE International Conference on Cluster Computing, CLUSTER 2014 - Madrid, Spain
Duration: 22 Sep 201426 Sep 2014

Other

Other16th IEEE International Conference on Cluster Computing, CLUSTER 2014
CountrySpain
CityMadrid
Period22/9/1426/9/14

Fingerprint

Cluster computing
Energy conservation
Scheduling
Fault tolerance
Energy utilization
Costs

Keywords

  • Checkpoint/Restart
  • Energy-efficiency
  • Fault tolerance
  • High-performance computing
  • Performance

ASJC Scopus subject areas

  • Computational Theory and Mathematics
  • Computer Science Applications
  • Software

Cite this

El-Sayed, N., & Schroeder, B. (2014). To checkpoint or not to checkpoint: Understanding energy-performance-I/O tradeoffs in HPC checkpointing. In 2014 IEEE International Conference on Cluster Computing, CLUSTER 2014 (pp. 93-102). [6968778] Institute of Electrical and Electronics Engineers Inc.. https://doi.org/10.1109/CLUSTER.2014.6968778

To checkpoint or not to checkpoint : Understanding energy-performance-I/O tradeoffs in HPC checkpointing. / El-Sayed, Nosayba; Schroeder, Bianca.

2014 IEEE International Conference on Cluster Computing, CLUSTER 2014. Institute of Electrical and Electronics Engineers Inc., 2014. p. 93-102 6968778.

Research output: Chapter in Book/Report/Conference proceedingConference contribution

El-Sayed, N & Schroeder, B 2014, To checkpoint or not to checkpoint: Understanding energy-performance-I/O tradeoffs in HPC checkpointing. in 2014 IEEE International Conference on Cluster Computing, CLUSTER 2014., 6968778, Institute of Electrical and Electronics Engineers Inc., pp. 93-102, 16th IEEE International Conference on Cluster Computing, CLUSTER 2014, Madrid, Spain, 22/9/14. https://doi.org/10.1109/CLUSTER.2014.6968778
El-Sayed N, Schroeder B. To checkpoint or not to checkpoint: Understanding energy-performance-I/O tradeoffs in HPC checkpointing. In 2014 IEEE International Conference on Cluster Computing, CLUSTER 2014. Institute of Electrical and Electronics Engineers Inc. 2014. p. 93-102. 6968778 https://doi.org/10.1109/CLUSTER.2014.6968778
El-Sayed, Nosayba ; Schroeder, Bianca. / To checkpoint or not to checkpoint : Understanding energy-performance-I/O tradeoffs in HPC checkpointing. 2014 IEEE International Conference on Cluster Computing, CLUSTER 2014. Institute of Electrical and Electronics Engineers Inc., 2014. pp. 93-102
@inproceedings{b2569c0b45e6433d93dc9658e82e9e22,
title = "To checkpoint or not to checkpoint: Understanding energy-performance-I/O tradeoffs in HPC checkpointing",
abstract = "As the scale of high-performance computing (HPC) clusters continues to grow, their increasing failure rates and energy consumption levels are emerging as two serious design concerns that are expected to become more challenging in future Exascale systems. Therefore, efficiently running systems at such large scales requires an in-depth understanding of the performance and energy costs associated with different fault tolerance techniques. The most commonly used fault tolerance method is checkpoint/restart. Over the years, checkpoint scheduling policies have been traditionally optimized and analysed from a performance perspective. Understanding the energy profile of these policies or how to optimize them for energy savings (rather than performance), remain not very well understood. In this paper, we provide an extensive analysis of the energy/ performance tradeoffs associated with an array of checkpoint scheduling policies, including policies that we propose, as well as few existing ones in the literature. We estimate the energy overhead for a given checkpointing policy, and provide simple formulas to optimize checkpoint scheduling for energy savings, with or without a bound on runtime. We then evaluate and compare the runtime-optimized and energy-optimized versions of the different methods using trace driven simulations based on failure logs from 10 production HPC clusters. Our results show ample room for achieving high energy savings with a low runtime overhead when using non-constant (adaptive) checkpointing methods that exploit characteristics of HPC failures. We also analyze the impact of energy-optimized checkpointing on the storage subsystem, identify policies that are more optimal for I/O savings, and study how to optimize for energy with a bound on I/O time.",
keywords = "Checkpoint/Restart, Energy-efficiency, Fault tolerance, High-performance computing, Performance",
author = "Nosayba El-Sayed and Bianca Schroeder",
year = "2014",
month = "1",
day = "1",
doi = "10.1109/CLUSTER.2014.6968778",
language = "English",
pages = "93--102",
booktitle = "2014 IEEE International Conference on Cluster Computing, CLUSTER 2014",
publisher = "Institute of Electrical and Electronics Engineers Inc.",

}

TY - GEN

T1 - To checkpoint or not to checkpoint

T2 - Understanding energy-performance-I/O tradeoffs in HPC checkpointing

AU - El-Sayed, Nosayba

AU - Schroeder, Bianca

PY - 2014/1/1

Y1 - 2014/1/1

N2 - As the scale of high-performance computing (HPC) clusters continues to grow, their increasing failure rates and energy consumption levels are emerging as two serious design concerns that are expected to become more challenging in future Exascale systems. Therefore, efficiently running systems at such large scales requires an in-depth understanding of the performance and energy costs associated with different fault tolerance techniques. The most commonly used fault tolerance method is checkpoint/restart. Over the years, checkpoint scheduling policies have been traditionally optimized and analysed from a performance perspective. Understanding the energy profile of these policies or how to optimize them for energy savings (rather than performance), remain not very well understood. In this paper, we provide an extensive analysis of the energy/ performance tradeoffs associated with an array of checkpoint scheduling policies, including policies that we propose, as well as few existing ones in the literature. We estimate the energy overhead for a given checkpointing policy, and provide simple formulas to optimize checkpoint scheduling for energy savings, with or without a bound on runtime. We then evaluate and compare the runtime-optimized and energy-optimized versions of the different methods using trace driven simulations based on failure logs from 10 production HPC clusters. Our results show ample room for achieving high energy savings with a low runtime overhead when using non-constant (adaptive) checkpointing methods that exploit characteristics of HPC failures. We also analyze the impact of energy-optimized checkpointing on the storage subsystem, identify policies that are more optimal for I/O savings, and study how to optimize for energy with a bound on I/O time.

AB - As the scale of high-performance computing (HPC) clusters continues to grow, their increasing failure rates and energy consumption levels are emerging as two serious design concerns that are expected to become more challenging in future Exascale systems. Therefore, efficiently running systems at such large scales requires an in-depth understanding of the performance and energy costs associated with different fault tolerance techniques. The most commonly used fault tolerance method is checkpoint/restart. Over the years, checkpoint scheduling policies have been traditionally optimized and analysed from a performance perspective. Understanding the energy profile of these policies or how to optimize them for energy savings (rather than performance), remain not very well understood. In this paper, we provide an extensive analysis of the energy/ performance tradeoffs associated with an array of checkpoint scheduling policies, including policies that we propose, as well as few existing ones in the literature. We estimate the energy overhead for a given checkpointing policy, and provide simple formulas to optimize checkpoint scheduling for energy savings, with or without a bound on runtime. We then evaluate and compare the runtime-optimized and energy-optimized versions of the different methods using trace driven simulations based on failure logs from 10 production HPC clusters. Our results show ample room for achieving high energy savings with a low runtime overhead when using non-constant (adaptive) checkpointing methods that exploit characteristics of HPC failures. We also analyze the impact of energy-optimized checkpointing on the storage subsystem, identify policies that are more optimal for I/O savings, and study how to optimize for energy with a bound on I/O time.

KW - Checkpoint/Restart

KW - Energy-efficiency

KW - Fault tolerance

KW - High-performance computing

KW - Performance

UR - http://www.scopus.com/inward/record.url?scp=84917696658&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=84917696658&partnerID=8YFLogxK

U2 - 10.1109/CLUSTER.2014.6968778

DO - 10.1109/CLUSTER.2014.6968778

M3 - Conference contribution

AN - SCOPUS:84917696658

SP - 93

EP - 102

BT - 2014 IEEE International Conference on Cluster Computing, CLUSTER 2014

PB - Institute of Electrical and Electronics Engineers Inc.

ER -