Improving the availability of supercomputer job input data using temporal replication

Chao Wang, Zhe Zhang, Xiaosong Ma, Sudharshan S. Vazhkudai, Frank Mueller

Research output: Contribution to journalArticle

6 Citations (Scopus)

Abstract

Storage systems in supercomputers are a major reason for service interruptions. RAID solutions alone cannot provide sufficient protection as 1) growing average disk recovery times make RAID groups increasingly vulnerable to disk failures during reconstruction, and 2) RAID does not help with higher-level faults such failed I/O nodes. This paper presents a complementary approach based on the observation that files in the supercomputer scratch space are typically accessed by batch jobs whose execution can be anticipated. Therefore, we propose to transparently, selectively, and temporarily replicate "active" job input data by coordinating the parallel file system with the batch job scheduler. We have implemented the temporal replication scheme in the popular Lustre parallel file system and evaluated it with real-cluster experiments. Our results show that the scheme allows for fast online data reconstruction, with a reasonably low overall space and I/O bandwidth overhead.

Original languageEnglish
Pages (from-to)149-157
Number of pages9
JournalComputer Science - Research and Development
Volume23
Issue number3-4
DOIs
Publication statusPublished - 1 Jun 2009
Externally publishedYes

Fingerprint

Supercomputers
Availability
Computer systems
Bandwidth
Recovery
Experiments

Keywords

  • Batch job scheduler
  • Parallel file system
  • Reliability
  • Supercomputer
  • Temporal replication

ASJC Scopus subject areas

  • Computer Science(all)

Cite this

Improving the availability of supercomputer job input data using temporal replication. / Wang, Chao; Zhang, Zhe; Ma, Xiaosong; Vazhkudai, Sudharshan S.; Mueller, Frank.

In: Computer Science - Research and Development, Vol. 23, No. 3-4, 01.06.2009, p. 149-157.

Research output: Contribution to journalArticle

Wang, Chao ; Zhang, Zhe ; Ma, Xiaosong ; Vazhkudai, Sudharshan S. ; Mueller, Frank. / Improving the availability of supercomputer job input data using temporal replication. In: Computer Science - Research and Development. 2009 ; Vol. 23, No. 3-4. pp. 149-157.
@article{1e1a1b721d48451bac8485eafb169a97,
title = "Improving the availability of supercomputer job input data using temporal replication",
abstract = "Storage systems in supercomputers are a major reason for service interruptions. RAID solutions alone cannot provide sufficient protection as 1) growing average disk recovery times make RAID groups increasingly vulnerable to disk failures during reconstruction, and 2) RAID does not help with higher-level faults such failed I/O nodes. This paper presents a complementary approach based on the observation that files in the supercomputer scratch space are typically accessed by batch jobs whose execution can be anticipated. Therefore, we propose to transparently, selectively, and temporarily replicate {"}active{"} job input data by coordinating the parallel file system with the batch job scheduler. We have implemented the temporal replication scheme in the popular Lustre parallel file system and evaluated it with real-cluster experiments. Our results show that the scheme allows for fast online data reconstruction, with a reasonably low overall space and I/O bandwidth overhead.",
keywords = "Batch job scheduler, Parallel file system, Reliability, Supercomputer, Temporal replication",
author = "Chao Wang and Zhe Zhang and Xiaosong Ma and Vazhkudai, {Sudharshan S.} and Frank Mueller",
year = "2009",
month = "6",
day = "1",
doi = "10.1007/s00450-009-0082-8",
language = "English",
volume = "23",
pages = "149--157",
journal = "Computer Science - Research and Development",
issn = "1865-2034",
publisher = "Springer Verlag",
number = "3-4",

}

TY - JOUR

T1 - Improving the availability of supercomputer job input data using temporal replication

AU - Wang, Chao

AU - Zhang, Zhe

AU - Ma, Xiaosong

AU - Vazhkudai, Sudharshan S.

AU - Mueller, Frank

PY - 2009/6/1

Y1 - 2009/6/1

N2 - Storage systems in supercomputers are a major reason for service interruptions. RAID solutions alone cannot provide sufficient protection as 1) growing average disk recovery times make RAID groups increasingly vulnerable to disk failures during reconstruction, and 2) RAID does not help with higher-level faults such failed I/O nodes. This paper presents a complementary approach based on the observation that files in the supercomputer scratch space are typically accessed by batch jobs whose execution can be anticipated. Therefore, we propose to transparently, selectively, and temporarily replicate "active" job input data by coordinating the parallel file system with the batch job scheduler. We have implemented the temporal replication scheme in the popular Lustre parallel file system and evaluated it with real-cluster experiments. Our results show that the scheme allows for fast online data reconstruction, with a reasonably low overall space and I/O bandwidth overhead.

AB - Storage systems in supercomputers are a major reason for service interruptions. RAID solutions alone cannot provide sufficient protection as 1) growing average disk recovery times make RAID groups increasingly vulnerable to disk failures during reconstruction, and 2) RAID does not help with higher-level faults such failed I/O nodes. This paper presents a complementary approach based on the observation that files in the supercomputer scratch space are typically accessed by batch jobs whose execution can be anticipated. Therefore, we propose to transparently, selectively, and temporarily replicate "active" job input data by coordinating the parallel file system with the batch job scheduler. We have implemented the temporal replication scheme in the popular Lustre parallel file system and evaluated it with real-cluster experiments. Our results show that the scheme allows for fast online data reconstruction, with a reasonably low overall space and I/O bandwidth overhead.

KW - Batch job scheduler

KW - Parallel file system

KW - Reliability

KW - Supercomputer

KW - Temporal replication

UR - http://www.scopus.com/inward/record.url?scp=67349212880&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=67349212880&partnerID=8YFLogxK

U2 - 10.1007/s00450-009-0082-8

DO - 10.1007/s00450-009-0082-8

M3 - Article

VL - 23

SP - 149

EP - 157

JO - Computer Science - Research and Development

JF - Computer Science - Research and Development

SN - 1865-2034

IS - 3-4

ER -