Employing checkpoint to improve job scheduling in large-scale systems

Shuangcheng Niu, Jidong Zhai, Xiaosong Ma, Mingliang Liu, Yan Zhai, Wenguang Chen, Weimin Zheng

Research output: Chapter in Book/Report/Conference proceedingConference contribution

8 Citations (Scopus)

Abstract

The FCFS-based backfill algorithm is widely used in scheduling high-performance computer systems. The algorithm relies on runtime estimate of jobs which is provided by users. However, statistics show the accuracy of user-provided estimate is poor. Users are very likely to provide a much longer runtime estimate than its real execution time. In this paper, we propose an aggressive backfilling approach with checkpoint based preemption to address the inaccuracy in user-provided runtime estimate. The approach is evaluated with real workload traces. The results show that compared with the FCFS-based backfill algorithm, our scheme improves the job scheduling performance in waiting time, slowdown and mean queue length by up to 40%. Meanwhile, only 4% of the jobs need to perform checkpoints.

Original languageEnglish
Title of host publicationLecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)
Pages36-55
Number of pages20
Volume7698 LNCS
DOIs
Publication statusPublished - 24 Jan 2013
Externally publishedYes
Event16th Workshop on Job Scheduling Strategies for Parallel Processing, JSSPP 2012 - Shanghai, China
Duration: 25 May 201225 May 2012

Publication series

NameLecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)
Volume7698 LNCS
ISSN (Print)03029743
ISSN (Electronic)16113349

Other

Other16th Workshop on Job Scheduling Strategies for Parallel Processing, JSSPP 2012
CountryChina
CityShanghai
Period25/5/1225/5/12

Fingerprint

Checkpoint
Job Scheduling
Large-scale Systems
Large scale systems
Scheduling
Estimate
Preemption
Queue Length
Computer systems
Waiting Time
Statistics
Execution Time
Workload
High Performance
Likely
Trace

Keywords

  • backfill algorithm
  • check-point/restart
  • job scheduling
  • runtime estimate

ASJC Scopus subject areas

  • Computer Science(all)
  • Theoretical Computer Science

Cite this

Niu, S., Zhai, J., Ma, X., Liu, M., Zhai, Y., Chen, W., & Zheng, W. (2013). Employing checkpoint to improve job scheduling in large-scale systems. In Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics) (Vol. 7698 LNCS, pp. 36-55). (Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics); Vol. 7698 LNCS). https://doi.org/10.1007/978-3-642-35867-8-3

Employing checkpoint to improve job scheduling in large-scale systems. / Niu, Shuangcheng; Zhai, Jidong; Ma, Xiaosong; Liu, Mingliang; Zhai, Yan; Chen, Wenguang; Zheng, Weimin.

Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics). Vol. 7698 LNCS 2013. p. 36-55 (Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics); Vol. 7698 LNCS).

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Niu, S, Zhai, J, Ma, X, Liu, M, Zhai, Y, Chen, W & Zheng, W 2013, Employing checkpoint to improve job scheduling in large-scale systems. in Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics). vol. 7698 LNCS, Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics), vol. 7698 LNCS, pp. 36-55, 16th Workshop on Job Scheduling Strategies for Parallel Processing, JSSPP 2012, Shanghai, China, 25/5/12. https://doi.org/10.1007/978-3-642-35867-8-3
Niu S, Zhai J, Ma X, Liu M, Zhai Y, Chen W et al. Employing checkpoint to improve job scheduling in large-scale systems. In Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics). Vol. 7698 LNCS. 2013. p. 36-55. (Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)). https://doi.org/10.1007/978-3-642-35867-8-3
Niu, Shuangcheng ; Zhai, Jidong ; Ma, Xiaosong ; Liu, Mingliang ; Zhai, Yan ; Chen, Wenguang ; Zheng, Weimin. / Employing checkpoint to improve job scheduling in large-scale systems. Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics). Vol. 7698 LNCS 2013. pp. 36-55 (Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)).
@inproceedings{4fc65cbb9db64937af02a04e1d507822,
title = "Employing checkpoint to improve job scheduling in large-scale systems",
abstract = "The FCFS-based backfill algorithm is widely used in scheduling high-performance computer systems. The algorithm relies on runtime estimate of jobs which is provided by users. However, statistics show the accuracy of user-provided estimate is poor. Users are very likely to provide a much longer runtime estimate than its real execution time. In this paper, we propose an aggressive backfilling approach with checkpoint based preemption to address the inaccuracy in user-provided runtime estimate. The approach is evaluated with real workload traces. The results show that compared with the FCFS-based backfill algorithm, our scheme improves the job scheduling performance in waiting time, slowdown and mean queue length by up to 40{\%}. Meanwhile, only 4{\%} of the jobs need to perform checkpoints.",
keywords = "backfill algorithm, check-point/restart, job scheduling, runtime estimate",
author = "Shuangcheng Niu and Jidong Zhai and Xiaosong Ma and Mingliang Liu and Yan Zhai and Wenguang Chen and Weimin Zheng",
year = "2013",
month = "1",
day = "24",
doi = "10.1007/978-3-642-35867-8-3",
language = "English",
isbn = "9783642358661",
volume = "7698 LNCS",
series = "Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)",
pages = "36--55",
booktitle = "Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)",

}

TY - GEN

T1 - Employing checkpoint to improve job scheduling in large-scale systems

AU - Niu, Shuangcheng

AU - Zhai, Jidong

AU - Ma, Xiaosong

AU - Liu, Mingliang

AU - Zhai, Yan

AU - Chen, Wenguang

AU - Zheng, Weimin

PY - 2013/1/24

Y1 - 2013/1/24

N2 - The FCFS-based backfill algorithm is widely used in scheduling high-performance computer systems. The algorithm relies on runtime estimate of jobs which is provided by users. However, statistics show the accuracy of user-provided estimate is poor. Users are very likely to provide a much longer runtime estimate than its real execution time. In this paper, we propose an aggressive backfilling approach with checkpoint based preemption to address the inaccuracy in user-provided runtime estimate. The approach is evaluated with real workload traces. The results show that compared with the FCFS-based backfill algorithm, our scheme improves the job scheduling performance in waiting time, slowdown and mean queue length by up to 40%. Meanwhile, only 4% of the jobs need to perform checkpoints.

AB - The FCFS-based backfill algorithm is widely used in scheduling high-performance computer systems. The algorithm relies on runtime estimate of jobs which is provided by users. However, statistics show the accuracy of user-provided estimate is poor. Users are very likely to provide a much longer runtime estimate than its real execution time. In this paper, we propose an aggressive backfilling approach with checkpoint based preemption to address the inaccuracy in user-provided runtime estimate. The approach is evaluated with real workload traces. The results show that compared with the FCFS-based backfill algorithm, our scheme improves the job scheduling performance in waiting time, slowdown and mean queue length by up to 40%. Meanwhile, only 4% of the jobs need to perform checkpoints.

KW - backfill algorithm

KW - check-point/restart

KW - job scheduling

KW - runtime estimate

UR - http://www.scopus.com/inward/record.url?scp=84872531290&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=84872531290&partnerID=8YFLogxK

U2 - 10.1007/978-3-642-35867-8-3

DO - 10.1007/978-3-642-35867-8-3

M3 - Conference contribution

AN - SCOPUS:84872531290

SN - 9783642358661

VL - 7698 LNCS

T3 - Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)

SP - 36

EP - 55

BT - Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)

ER -