Employing checkpoint to improve job scheduling in large-scale systems

Shuangcheng Niu, Jidong Zhai, Xiaosong Ma, Mingliang Liu, Yan Zhai, Wenguang Chen, Weimin Zheng

Research output: Chapter in Book/Report/Conference proceedingConference contribution

8 Citations (Scopus)

Abstract

The FCFS-based backfill algorithm is widely used in scheduling high-performance computer systems. The algorithm relies on runtime estimate of jobs which is provided by users. However, statistics show the accuracy of user-provided estimate is poor. Users are very likely to provide a much longer runtime estimate than its real execution time. In this paper, we propose an aggressive backfilling approach with checkpoint based preemption to address the inaccuracy in user-provided runtime estimate. The approach is evaluated with real workload traces. The results show that compared with the FCFS-based backfill algorithm, our scheme improves the job scheduling performance in waiting time, slowdown and mean queue length by up to 40%. Meanwhile, only 4% of the jobs need to perform checkpoints.

Original languageEnglish
Title of host publicationJob Scheduling Strategies for Parallel Processing - 16th International Workshop, JSSPP 2012, Revised Selected Papers
Pages36-55
Number of pages20
DOIs
Publication statusPublished - 24 Jan 2013
Externally publishedYes
Event16th Workshop on Job Scheduling Strategies for Parallel Processing, JSSPP 2012 - Shanghai, China
Duration: 25 May 201225 May 2012

Publication series

NameLecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)
Volume7698 LNCS
ISSN (Print)0302-9743
ISSN (Electronic)1611-3349

Other

Other16th Workshop on Job Scheduling Strategies for Parallel Processing, JSSPP 2012
CountryChina
CityShanghai
Period25/5/1225/5/12

Keywords

  • backfill algorithm
  • check-point/restart
  • job scheduling
  • runtime estimate

ASJC Scopus subject areas

  • Theoretical Computer Science
  • Computer Science(all)

Fingerprint Dive into the research topics of 'Employing checkpoint to improve job scheduling in large-scale systems'. Together they form a unique fingerprint.

  • Cite this

    Niu, S., Zhai, J., Ma, X., Liu, M., Zhai, Y., Chen, W., & Zheng, W. (2013). Employing checkpoint to improve job scheduling in large-scale systems. In Job Scheduling Strategies for Parallel Processing - 16th International Workshop, JSSPP 2012, Revised Selected Papers (pp. 36-55). (Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics); Vol. 7698 LNCS). https://doi.org/10.1007/978-3-642-35867-8-3