Recovering transient data: Automated on-demand data reconstruction and offloading for supercomputers

Sudharshan Vazhkudai, Xiaosong Ma

Research output: Chapter in Book/Report/Conference proceedingConference contribution

3 Citations (Scopus)

Abstract

It has become a national priority to build and use PetaFlop supercomputers. The dependability of such large systems has been recognized as a key issue that can impact their usability. Even with smaller, existing machines, failures are the norm rather than an exception. Research has shown that storage systems are the primary source of faults leading to supercomputer unavailability. In this paper, we envision two mechanisms, namely on-demand data reconstruction and eager data offloading, to address the availability of job input/output data. These two techniques aim to allow parallel jobs and post-job processing tools to continue execution despite storage system failures in supercomputers. Fundamental to both approaches is the definition and acquisition of recovery-related parallel file system metadata, which is then coupled with transparent remote data accesses. Our approach attempts to maximize the utilization of precious supercomputer resources by improving the accessibility of transient job data. Further, the proposed methods are best-effort in nature and complement existing file system recovery schemes, which are designed for persistent data. Several of our previous studies help in demonstrating the feasibility of the proposed approaches.

Original languageEnglish
Title of host publicationOperating Systems Review (ACM)
Pages14-18
Number of pages5
Volume41
Edition1
DOIs
Publication statusPublished - 1 Jan 2007
Externally publishedYes

Fingerprint

Supercomputers
Recovery
Metadata
Availability
Processing

Keywords

  • Data reconstruction
  • File system recovery
  • Supercomputer avalability

ASJC Scopus subject areas

  • Computer Networks and Communications
  • Hardware and Architecture
  • Information Systems

Cite this

Recovering transient data : Automated on-demand data reconstruction and offloading for supercomputers. / Vazhkudai, Sudharshan; Ma, Xiaosong.

Operating Systems Review (ACM). Vol. 41 1. ed. 2007. p. 14-18.

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Vazhkudai, Sudharshan ; Ma, Xiaosong. / Recovering transient data : Automated on-demand data reconstruction and offloading for supercomputers. Operating Systems Review (ACM). Vol. 41 1. ed. 2007. pp. 14-18
@inproceedings{37d851037a924b6290dcfd4578fed14f,
title = "Recovering transient data: Automated on-demand data reconstruction and offloading for supercomputers",
abstract = "It has become a national priority to build and use PetaFlop supercomputers. The dependability of such large systems has been recognized as a key issue that can impact their usability. Even with smaller, existing machines, failures are the norm rather than an exception. Research has shown that storage systems are the primary source of faults leading to supercomputer unavailability. In this paper, we envision two mechanisms, namely on-demand data reconstruction and eager data offloading, to address the availability of job input/output data. These two techniques aim to allow parallel jobs and post-job processing tools to continue execution despite storage system failures in supercomputers. Fundamental to both approaches is the definition and acquisition of recovery-related parallel file system metadata, which is then coupled with transparent remote data accesses. Our approach attempts to maximize the utilization of precious supercomputer resources by improving the accessibility of transient job data. Further, the proposed methods are best-effort in nature and complement existing file system recovery schemes, which are designed for persistent data. Several of our previous studies help in demonstrating the feasibility of the proposed approaches.",
keywords = "Data reconstruction, File system recovery, Supercomputer avalability",
author = "Sudharshan Vazhkudai and Xiaosong Ma",
year = "2007",
month = "1",
day = "1",
doi = "10.1145/1228291.1228297",
language = "English",
volume = "41",
pages = "14--18",
booktitle = "Operating Systems Review (ACM)",
edition = "1",

}

TY - GEN

T1 - Recovering transient data

T2 - Automated on-demand data reconstruction and offloading for supercomputers

AU - Vazhkudai, Sudharshan

AU - Ma, Xiaosong

PY - 2007/1/1

Y1 - 2007/1/1

N2 - It has become a national priority to build and use PetaFlop supercomputers. The dependability of such large systems has been recognized as a key issue that can impact their usability. Even with smaller, existing machines, failures are the norm rather than an exception. Research has shown that storage systems are the primary source of faults leading to supercomputer unavailability. In this paper, we envision two mechanisms, namely on-demand data reconstruction and eager data offloading, to address the availability of job input/output data. These two techniques aim to allow parallel jobs and post-job processing tools to continue execution despite storage system failures in supercomputers. Fundamental to both approaches is the definition and acquisition of recovery-related parallel file system metadata, which is then coupled with transparent remote data accesses. Our approach attempts to maximize the utilization of precious supercomputer resources by improving the accessibility of transient job data. Further, the proposed methods are best-effort in nature and complement existing file system recovery schemes, which are designed for persistent data. Several of our previous studies help in demonstrating the feasibility of the proposed approaches.

AB - It has become a national priority to build and use PetaFlop supercomputers. The dependability of such large systems has been recognized as a key issue that can impact their usability. Even with smaller, existing machines, failures are the norm rather than an exception. Research has shown that storage systems are the primary source of faults leading to supercomputer unavailability. In this paper, we envision two mechanisms, namely on-demand data reconstruction and eager data offloading, to address the availability of job input/output data. These two techniques aim to allow parallel jobs and post-job processing tools to continue execution despite storage system failures in supercomputers. Fundamental to both approaches is the definition and acquisition of recovery-related parallel file system metadata, which is then coupled with transparent remote data accesses. Our approach attempts to maximize the utilization of precious supercomputer resources by improving the accessibility of transient job data. Further, the proposed methods are best-effort in nature and complement existing file system recovery schemes, which are designed for persistent data. Several of our previous studies help in demonstrating the feasibility of the proposed approaches.

KW - Data reconstruction

KW - File system recovery

KW - Supercomputer avalability

UR - http://www.scopus.com/inward/record.url?scp=70149084445&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=70149084445&partnerID=8YFLogxK

U2 - 10.1145/1228291.1228297

DO - 10.1145/1228291.1228297

M3 - Conference contribution

AN - SCOPUS:70149084445

VL - 41

SP - 14

EP - 18

BT - Operating Systems Review (ACM)

ER -