Employing checkpoint to improve job scheduling in large-scale systems

Shuangcheng Niu, Jidong Zhai, Xiaosong Ma, Mingliang Liu, Yan Zhai, Wenguang Chen, Weimin Zheng

Research output: Chapter in Book/Report/Conference proceedingConference contribution

8 Citations (Scopus)

Abstract

The FCFS-based backfill algorithm is widely used in scheduling high-performance computer systems. The algorithm relies on runtime estimate of jobs which is provided by users. However, statistics show the accuracy of user-provided estimate is poor. Users are very likely to provide a much longer runtime estimate than its real execution time. In this paper, we propose an aggressive backfilling approach with checkpoint based preemption to address the inaccuracy in user-provided runtime estimate. The approach is evaluated with real workload traces. The results show that compared with the FCFS-based backfill algorithm, our scheme improves the job scheduling performance in waiting time, slowdown and mean queue length by up to 40%. Meanwhile, only 4% of the jobs need to perform checkpoints.

Original languageEnglish
Title of host publicationLecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)
Pages36-55
Number of pages20
Volume7698 LNCS
DOIs
Publication statusPublished - 24 Jan 2013
Externally publishedYes
Event16th Workshop on Job Scheduling Strategies for Parallel Processing, JSSPP 2012 - Shanghai, China
Duration: 25 May 201225 May 2012

Publication series

NameLecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)
Volume7698 LNCS
ISSN (Print)03029743
ISSN (Electronic)16113349

Other

Other16th Workshop on Job Scheduling Strategies for Parallel Processing, JSSPP 2012
CountryChina
CityShanghai
Period25/5/1225/5/12

    Fingerprint

Keywords

  • backfill algorithm
  • check-point/restart
  • job scheduling
  • runtime estimate

ASJC Scopus subject areas

  • Computer Science(all)
  • Theoretical Computer Science

Cite this

Niu, S., Zhai, J., Ma, X., Liu, M., Zhai, Y., Chen, W., & Zheng, W. (2013). Employing checkpoint to improve job scheduling in large-scale systems. In Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics) (Vol. 7698 LNCS, pp. 36-55). (Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics); Vol. 7698 LNCS). https://doi.org/10.1007/978-3-642-35867-8-3