On-the-fly recovery of job input data in supercomputers

Chao Wang, Zhe Zhang, Sudharshan S. Vazhkudai, Xiaosong Ma, Frank Mueller

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Abstract

Storage system failure is a serious concern as we approach Petascale computing. Even at today's sub-Petascale levels, I/O failure is the leading cause of downtimes and job failures. We contribute a novel, on-the-fly recovery frame-work for job input data into supercomputer parallel file systems. The framework exploits key traits of the HPC I/O workload to reconstruct lost input data during job execution from remote, immutable copies. Each reconstructed data stripe is made immediately accessible in the client request order due to the delayed metadata update and fine-granular locking while unrelated access to the same file remains unaffected. We have implemented the recovery component within the Lustre parallel file system, thus building a novel application-transparent online recovery solution. Our solution is integrated into Lustre 's two-level locking scheme using a two-phase blocking protocol. Combining parametric and simulation studies, our experiments demonstrate a significant improvement in HPC center serviceability and user job turnaround time.

Original languageEnglish
Title of host publicationProceedings of the International Conference on Parallel Processing
Pages620-627
Number of pages8
DOIs
Publication statusPublished - 17 Nov 2008
Externally publishedYes
Event37th International Conference on Parallel Processing, ICPP 2008 - Portland, OR, United States
Duration: 9 Sep 200812 Sep 2008

Other

Other37th International Conference on Parallel Processing, ICPP 2008
CountryUnited States
CityPortland, OR
Period9/9/0812/9/08

Fingerprint

Supercomputers
Supercomputer
Parallel File System
Recovery
Locking
Incentre
Turnaround time
Storage System
Metadata
Workload
Immediately
Update
Simulation Study
Computing
Demonstrate
Experiment
Experiments
Framework

ASJC Scopus subject areas

  • Software
  • Mathematics(all)
  • Hardware and Architecture

Cite this

Wang, C., Zhang, Z., Vazhkudai, S. S., Ma, X., & Mueller, F. (2008). On-the-fly recovery of job input data in supercomputers. In Proceedings of the International Conference on Parallel Processing (pp. 620-627). [4625901] https://doi.org/10.1109/ICPP.2008.28

On-the-fly recovery of job input data in supercomputers. / Wang, Chao; Zhang, Zhe; Vazhkudai, Sudharshan S.; Ma, Xiaosong; Mueller, Frank.

Proceedings of the International Conference on Parallel Processing. 2008. p. 620-627 4625901.

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Wang, C, Zhang, Z, Vazhkudai, SS, Ma, X & Mueller, F 2008, On-the-fly recovery of job input data in supercomputers. in Proceedings of the International Conference on Parallel Processing., 4625901, pp. 620-627, 37th International Conference on Parallel Processing, ICPP 2008, Portland, OR, United States, 9/9/08. https://doi.org/10.1109/ICPP.2008.28
Wang C, Zhang Z, Vazhkudai SS, Ma X, Mueller F. On-the-fly recovery of job input data in supercomputers. In Proceedings of the International Conference on Parallel Processing. 2008. p. 620-627. 4625901 https://doi.org/10.1109/ICPP.2008.28
Wang, Chao ; Zhang, Zhe ; Vazhkudai, Sudharshan S. ; Ma, Xiaosong ; Mueller, Frank. / On-the-fly recovery of job input data in supercomputers. Proceedings of the International Conference on Parallel Processing. 2008. pp. 620-627
@inproceedings{2fca16ce6add4df1bdfb23f9e85c3100,
title = "On-the-fly recovery of job input data in supercomputers",
abstract = "Storage system failure is a serious concern as we approach Petascale computing. Even at today's sub-Petascale levels, I/O failure is the leading cause of downtimes and job failures. We contribute a novel, on-the-fly recovery frame-work for job input data into supercomputer parallel file systems. The framework exploits key traits of the HPC I/O workload to reconstruct lost input data during job execution from remote, immutable copies. Each reconstructed data stripe is made immediately accessible in the client request order due to the delayed metadata update and fine-granular locking while unrelated access to the same file remains unaffected. We have implemented the recovery component within the Lustre parallel file system, thus building a novel application-transparent online recovery solution. Our solution is integrated into Lustre 's two-level locking scheme using a two-phase blocking protocol. Combining parametric and simulation studies, our experiments demonstrate a significant improvement in HPC center serviceability and user job turnaround time.",
author = "Chao Wang and Zhe Zhang and Vazhkudai, {Sudharshan S.} and Xiaosong Ma and Frank Mueller",
year = "2008",
month = "11",
day = "17",
doi = "10.1109/ICPP.2008.28",
language = "English",
isbn = "9780769533742",
pages = "620--627",
booktitle = "Proceedings of the International Conference on Parallel Processing",

}

TY - GEN

T1 - On-the-fly recovery of job input data in supercomputers

AU - Wang, Chao

AU - Zhang, Zhe

AU - Vazhkudai, Sudharshan S.

AU - Ma, Xiaosong

AU - Mueller, Frank

PY - 2008/11/17

Y1 - 2008/11/17

N2 - Storage system failure is a serious concern as we approach Petascale computing. Even at today's sub-Petascale levels, I/O failure is the leading cause of downtimes and job failures. We contribute a novel, on-the-fly recovery frame-work for job input data into supercomputer parallel file systems. The framework exploits key traits of the HPC I/O workload to reconstruct lost input data during job execution from remote, immutable copies. Each reconstructed data stripe is made immediately accessible in the client request order due to the delayed metadata update and fine-granular locking while unrelated access to the same file remains unaffected. We have implemented the recovery component within the Lustre parallel file system, thus building a novel application-transparent online recovery solution. Our solution is integrated into Lustre 's two-level locking scheme using a two-phase blocking protocol. Combining parametric and simulation studies, our experiments demonstrate a significant improvement in HPC center serviceability and user job turnaround time.

AB - Storage system failure is a serious concern as we approach Petascale computing. Even at today's sub-Petascale levels, I/O failure is the leading cause of downtimes and job failures. We contribute a novel, on-the-fly recovery frame-work for job input data into supercomputer parallel file systems. The framework exploits key traits of the HPC I/O workload to reconstruct lost input data during job execution from remote, immutable copies. Each reconstructed data stripe is made immediately accessible in the client request order due to the delayed metadata update and fine-granular locking while unrelated access to the same file remains unaffected. We have implemented the recovery component within the Lustre parallel file system, thus building a novel application-transparent online recovery solution. Our solution is integrated into Lustre 's two-level locking scheme using a two-phase blocking protocol. Combining parametric and simulation studies, our experiments demonstrate a significant improvement in HPC center serviceability and user job turnaround time.

UR - http://www.scopus.com/inward/record.url?scp=55849114447&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=55849114447&partnerID=8YFLogxK

U2 - 10.1109/ICPP.2008.28

DO - 10.1109/ICPP.2008.28

M3 - Conference contribution

AN - SCOPUS:55849114447

SN - 9780769533742

SP - 620

EP - 627

BT - Proceedings of the International Conference on Parallel Processing

ER -