Optimizing center performance through coordinated data staging, scheduling and recovery

Zhe Zhang, Chao Wang, Sudharshan S. Vazhkudai, Xiaosong Ma, Gregory G. Pike, John W. Cobb, Frank Mueller

Research output: Chapter in Book/Report/Conference proceedingConference contribution

13 Citations (Scopus)

Abstract

Procurement and the optimized utilization of Petascale supercomputers and centers is a renewed national priority. Sustained performance and availability of such large centers is a key technical challenge significantly impacting their usability. Storage systems are known to be the primary fault source leading to data unavailability and job resubmissions. This results in reduced center performance, partially due to the lack of coordination between I/O activities and job scheduling. In this work, we propose the coordination of job scheduling with data staging/offloading and on-demand staged data reconstruction to address the availability of job input data and to improve centerwide performance. Fundamental to both mechanisms is the efficient management of transient data: in the way it is scheduled and recovered. Collectively, from a center's standpoint, these techniques optimize resource usage and increase its data/service availability. From a user's standpoint, they reduce the job turnaround time and optimize the allocated time usage. (c) 2007 ACM.

Original languageEnglish
Title of host publicationProceedings of the 2007 ACM/IEEE Conference on Supercomputing, SC'07
DOIs
Publication statusPublished - 1 Dec 2007
Externally publishedYes
Event2007 ACM/IEEE Conference on Supercomputing, SC'07 - Reno, NV, United States
Duration: 10 Nov 200716 Nov 2007

Other

Other2007 ACM/IEEE Conference on Supercomputing, SC'07
CountryUnited States
CityReno, NV
Period10/11/0716/11/07

Fingerprint

Scheduling
Availability
Recovery
Turnaround time
Supercomputers

Keywords

  • Coordinated scheduling
  • Data scheduling
  • Data staging
  • HPC center performance optimization
  • Transient data recovery

ASJC Scopus subject areas

  • Computer Networks and Communications
  • Software
  • Electrical and Electronic Engineering

Cite this

Zhang, Z., Wang, C., Vazhkudai, S. S., Ma, X., Pike, G. G., Cobb, J. W., & Mueller, F. (2007). Optimizing center performance through coordinated data staging, scheduling and recovery. In Proceedings of the 2007 ACM/IEEE Conference on Supercomputing, SC'07 [55] https://doi.org/10.1145/1362622.1362696

Optimizing center performance through coordinated data staging, scheduling and recovery. / Zhang, Zhe; Wang, Chao; Vazhkudai, Sudharshan S.; Ma, Xiaosong; Pike, Gregory G.; Cobb, John W.; Mueller, Frank.

Proceedings of the 2007 ACM/IEEE Conference on Supercomputing, SC'07. 2007. 55.

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Zhang, Z, Wang, C, Vazhkudai, SS, Ma, X, Pike, GG, Cobb, JW & Mueller, F 2007, Optimizing center performance through coordinated data staging, scheduling and recovery. in Proceedings of the 2007 ACM/IEEE Conference on Supercomputing, SC'07., 55, 2007 ACM/IEEE Conference on Supercomputing, SC'07, Reno, NV, United States, 10/11/07. https://doi.org/10.1145/1362622.1362696
Zhang Z, Wang C, Vazhkudai SS, Ma X, Pike GG, Cobb JW et al. Optimizing center performance through coordinated data staging, scheduling and recovery. In Proceedings of the 2007 ACM/IEEE Conference on Supercomputing, SC'07. 2007. 55 https://doi.org/10.1145/1362622.1362696
Zhang, Zhe ; Wang, Chao ; Vazhkudai, Sudharshan S. ; Ma, Xiaosong ; Pike, Gregory G. ; Cobb, John W. ; Mueller, Frank. / Optimizing center performance through coordinated data staging, scheduling and recovery. Proceedings of the 2007 ACM/IEEE Conference on Supercomputing, SC'07. 2007.
@inproceedings{0da7ddf865694ae6b63a675948154553,
title = "Optimizing center performance through coordinated data staging, scheduling and recovery",
abstract = "Procurement and the optimized utilization of Petascale supercomputers and centers is a renewed national priority. Sustained performance and availability of such large centers is a key technical challenge significantly impacting their usability. Storage systems are known to be the primary fault source leading to data unavailability and job resubmissions. This results in reduced center performance, partially due to the lack of coordination between I/O activities and job scheduling. In this work, we propose the coordination of job scheduling with data staging/offloading and on-demand staged data reconstruction to address the availability of job input data and to improve centerwide performance. Fundamental to both mechanisms is the efficient management of transient data: in the way it is scheduled and recovered. Collectively, from a center's standpoint, these techniques optimize resource usage and increase its data/service availability. From a user's standpoint, they reduce the job turnaround time and optimize the allocated time usage. (c) 2007 ACM.",
keywords = "Coordinated scheduling, Data scheduling, Data staging, HPC center performance optimization, Transient data recovery",
author = "Zhe Zhang and Chao Wang and Vazhkudai, {Sudharshan S.} and Xiaosong Ma and Pike, {Gregory G.} and Cobb, {John W.} and Frank Mueller",
year = "2007",
month = "12",
day = "1",
doi = "10.1145/1362622.1362696",
language = "English",
isbn = "9781595937643",
booktitle = "Proceedings of the 2007 ACM/IEEE Conference on Supercomputing, SC'07",

}

TY - GEN

T1 - Optimizing center performance through coordinated data staging, scheduling and recovery

AU - Zhang, Zhe

AU - Wang, Chao

AU - Vazhkudai, Sudharshan S.

AU - Ma, Xiaosong

AU - Pike, Gregory G.

AU - Cobb, John W.

AU - Mueller, Frank

PY - 2007/12/1

Y1 - 2007/12/1

N2 - Procurement and the optimized utilization of Petascale supercomputers and centers is a renewed national priority. Sustained performance and availability of such large centers is a key technical challenge significantly impacting their usability. Storage systems are known to be the primary fault source leading to data unavailability and job resubmissions. This results in reduced center performance, partially due to the lack of coordination between I/O activities and job scheduling. In this work, we propose the coordination of job scheduling with data staging/offloading and on-demand staged data reconstruction to address the availability of job input data and to improve centerwide performance. Fundamental to both mechanisms is the efficient management of transient data: in the way it is scheduled and recovered. Collectively, from a center's standpoint, these techniques optimize resource usage and increase its data/service availability. From a user's standpoint, they reduce the job turnaround time and optimize the allocated time usage. (c) 2007 ACM.

AB - Procurement and the optimized utilization of Petascale supercomputers and centers is a renewed national priority. Sustained performance and availability of such large centers is a key technical challenge significantly impacting their usability. Storage systems are known to be the primary fault source leading to data unavailability and job resubmissions. This results in reduced center performance, partially due to the lack of coordination between I/O activities and job scheduling. In this work, we propose the coordination of job scheduling with data staging/offloading and on-demand staged data reconstruction to address the availability of job input data and to improve centerwide performance. Fundamental to both mechanisms is the efficient management of transient data: in the way it is scheduled and recovered. Collectively, from a center's standpoint, these techniques optimize resource usage and increase its data/service availability. From a user's standpoint, they reduce the job turnaround time and optimize the allocated time usage. (c) 2007 ACM.

KW - Coordinated scheduling

KW - Data scheduling

KW - Data staging

KW - HPC center performance optimization

KW - Transient data recovery

UR - http://www.scopus.com/inward/record.url?scp=56749179540&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=56749179540&partnerID=8YFLogxK

U2 - 10.1145/1362622.1362696

DO - 10.1145/1362622.1362696

M3 - Conference contribution

SN - 9781595937643

BT - Proceedings of the 2007 ACM/IEEE Conference on Supercomputing, SC'07

ER -