Understanding Practical Tradeoffs in HPC Checkpoint-Scheduling Policies

Nosayba El-Sayed, Bianca Schroeder

Research output: Contribution to journalArticle

3 Citations (Scopus)

Abstract

As the scale of High-Performance Computing (HPC) clusters continues to grow, their increasing failure rates and energy consumption levels are emerging as serious design concerns. Efficiently running systems at such large scales critically relies on deploying effective, practical methods for fault tolerance while having a good understanding of their respective performance and energy overheads. The most commonly used fault tolerance method is checkpoint/restart. Checkpoint scheduling policies, however, have been traditionally optimized and analysed from one angle: application performance. In this work, we provide an extensive analysis of the performance, energy and I/O costs associated with a wide array of checkpointing policies. We consider practical deployment issues and show that simple formulas can be used to accurately estimate wasted work in a system. We propose methods to optimize checkpoint scheduling for energy savings and evaluate the runtime-optimized and energy-optimized policies using simulations based on failure logs from 10 production HPC clusters. Our results show ample room for achieving high quality energy/performance tradeoffs when using methods that exploit characteristics of real world failures. We also analyze the impact of energy-optimized checkpointing on the storage subsystem and identify policies that are optimal for I/O savings.

Original languageEnglish
Pages (from-to)336-350
Number of pages15
JournalIEEE Transactions on Dependable and Secure Computing
Volume15
Issue number2
DOIs
Publication statusPublished - 1 Mar 2018

Fingerprint

Cluster computing
Fault tolerance
Scheduling
Energy policy
Energy conservation
Energy utilization
Costs

Keywords

  • checkpoint/restart
  • energy-efficiency
  • fault tolerance
  • High-performance computing
  • i/o subsystem
  • performance

ASJC Scopus subject areas

  • Electrical and Electronic Engineering

Cite this

Understanding Practical Tradeoffs in HPC Checkpoint-Scheduling Policies. / El-Sayed, Nosayba; Schroeder, Bianca.

In: IEEE Transactions on Dependable and Secure Computing, Vol. 15, No. 2, 01.03.2018, p. 336-350.

Research output: Contribution to journalArticle

@article{c0a075dd451c42c98c2743047be7bbf5,
title = "Understanding Practical Tradeoffs in HPC Checkpoint-Scheduling Policies",
abstract = "As the scale of High-Performance Computing (HPC) clusters continues to grow, their increasing failure rates and energy consumption levels are emerging as serious design concerns. Efficiently running systems at such large scales critically relies on deploying effective, practical methods for fault tolerance while having a good understanding of their respective performance and energy overheads. The most commonly used fault tolerance method is checkpoint/restart. Checkpoint scheduling policies, however, have been traditionally optimized and analysed from one angle: application performance. In this work, we provide an extensive analysis of the performance, energy and I/O costs associated with a wide array of checkpointing policies. We consider practical deployment issues and show that simple formulas can be used to accurately estimate wasted work in a system. We propose methods to optimize checkpoint scheduling for energy savings and evaluate the runtime-optimized and energy-optimized policies using simulations based on failure logs from 10 production HPC clusters. Our results show ample room for achieving high quality energy/performance tradeoffs when using methods that exploit characteristics of real world failures. We also analyze the impact of energy-optimized checkpointing on the storage subsystem and identify policies that are optimal for I/O savings.",
keywords = "checkpoint/restart, energy-efficiency, fault tolerance, High-performance computing, i/o subsystem, performance",
author = "Nosayba El-Sayed and Bianca Schroeder",
year = "2018",
month = "3",
day = "1",
doi = "10.1109/TDSC.2016.2548463",
language = "English",
volume = "15",
pages = "336--350",
journal = "IEEE Transactions on Dependable and Secure Computing",
issn = "1545-5971",
publisher = "Institute of Electrical and Electronics Engineers Inc.",
number = "2",

}

TY - JOUR

T1 - Understanding Practical Tradeoffs in HPC Checkpoint-Scheduling Policies

AU - El-Sayed, Nosayba

AU - Schroeder, Bianca

PY - 2018/3/1

Y1 - 2018/3/1

N2 - As the scale of High-Performance Computing (HPC) clusters continues to grow, their increasing failure rates and energy consumption levels are emerging as serious design concerns. Efficiently running systems at such large scales critically relies on deploying effective, practical methods for fault tolerance while having a good understanding of their respective performance and energy overheads. The most commonly used fault tolerance method is checkpoint/restart. Checkpoint scheduling policies, however, have been traditionally optimized and analysed from one angle: application performance. In this work, we provide an extensive analysis of the performance, energy and I/O costs associated with a wide array of checkpointing policies. We consider practical deployment issues and show that simple formulas can be used to accurately estimate wasted work in a system. We propose methods to optimize checkpoint scheduling for energy savings and evaluate the runtime-optimized and energy-optimized policies using simulations based on failure logs from 10 production HPC clusters. Our results show ample room for achieving high quality energy/performance tradeoffs when using methods that exploit characteristics of real world failures. We also analyze the impact of energy-optimized checkpointing on the storage subsystem and identify policies that are optimal for I/O savings.

AB - As the scale of High-Performance Computing (HPC) clusters continues to grow, their increasing failure rates and energy consumption levels are emerging as serious design concerns. Efficiently running systems at such large scales critically relies on deploying effective, practical methods for fault tolerance while having a good understanding of their respective performance and energy overheads. The most commonly used fault tolerance method is checkpoint/restart. Checkpoint scheduling policies, however, have been traditionally optimized and analysed from one angle: application performance. In this work, we provide an extensive analysis of the performance, energy and I/O costs associated with a wide array of checkpointing policies. We consider practical deployment issues and show that simple formulas can be used to accurately estimate wasted work in a system. We propose methods to optimize checkpoint scheduling for energy savings and evaluate the runtime-optimized and energy-optimized policies using simulations based on failure logs from 10 production HPC clusters. Our results show ample room for achieving high quality energy/performance tradeoffs when using methods that exploit characteristics of real world failures. We also analyze the impact of energy-optimized checkpointing on the storage subsystem and identify policies that are optimal for I/O savings.

KW - checkpoint/restart

KW - energy-efficiency

KW - fault tolerance

KW - High-performance computing

KW - i/o subsystem

KW - performance

UR - http://www.scopus.com/inward/record.url?scp=85013195288&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=85013195288&partnerID=8YFLogxK

U2 - 10.1109/TDSC.2016.2548463

DO - 10.1109/TDSC.2016.2548463

M3 - Article

VL - 15

SP - 336

EP - 350

JO - IEEE Transactions on Dependable and Secure Computing

JF - IEEE Transactions on Dependable and Secure Computing

SN - 1545-5971

IS - 2

ER -