Self-refined fault tolerance in HPC using dynamic dependent process groups

N. P. Gopalan, Nagarajan Kathiresan

Research output: Chapter in Book/Report/Conference proceedingConference contribution

5 Citations (Scopus)

Abstract

This paper proposes a novel method for achieving a distributed self-refined fault tolerance by dynamically partitioning the processes into smaller groups, which are mutually disjoint and collectively exhaustive of the whole system. The present model provides tolerance for frequent faults, makes the roll back recovery simple and less time consuming. An optimal checkpoint interval is found using a mathematical approximation and a spare process is made to capture all the in-transit messages when a process fails at its ends. Piggybacking the events of dependent processes on the outgoing messages is used for process grouping. A process with maximum information can scatter chunk values to the other dependent processes in its group. Each process constructs a checkpoint when the received chunk matches with its log.

Original languageEnglish
Title of host publicationDistributed Computing - IWDC 2005 - 7th International Workshop, Proceedings
Pages153-158
Number of pages6
Volume3741 LNCS
Publication statusPublished - 2005
Externally publishedYes
Event7th International Workshop on Distributed Computing, IWDC 2005 - Kharagpur, India
Duration: 27 Dec 200530 Dec 2005

Publication series

NameLecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)
Volume3741 LNCS
ISSN (Print)03029743
ISSN (Electronic)16113349

Other

Other7th International Workshop on Distributed Computing, IWDC 2005
CountryIndia
CityKharagpur
Period27/12/0530/12/05

Fingerprint

Fault tolerance
Fault Tolerance
Recovery
Dependent
Checkpoint
Scatter
Grouping
Tolerance
Partitioning
Disjoint
Fault
Interval
Approximation

ASJC Scopus subject areas

  • Theoretical Computer Science
  • Computer Science(all)

Cite this

Gopalan, N. P., & Kathiresan, N. (2005). Self-refined fault tolerance in HPC using dynamic dependent process groups. In Distributed Computing - IWDC 2005 - 7th International Workshop, Proceedings (Vol. 3741 LNCS, pp. 153-158). (Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics); Vol. 3741 LNCS).

Self-refined fault tolerance in HPC using dynamic dependent process groups. / Gopalan, N. P.; Kathiresan, Nagarajan.

Distributed Computing - IWDC 2005 - 7th International Workshop, Proceedings. Vol. 3741 LNCS 2005. p. 153-158 (Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics); Vol. 3741 LNCS).

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Gopalan, NP & Kathiresan, N 2005, Self-refined fault tolerance in HPC using dynamic dependent process groups. in Distributed Computing - IWDC 2005 - 7th International Workshop, Proceedings. vol. 3741 LNCS, Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics), vol. 3741 LNCS, pp. 153-158, 7th International Workshop on Distributed Computing, IWDC 2005, Kharagpur, India, 27/12/05.
Gopalan NP, Kathiresan N. Self-refined fault tolerance in HPC using dynamic dependent process groups. In Distributed Computing - IWDC 2005 - 7th International Workshop, Proceedings. Vol. 3741 LNCS. 2005. p. 153-158. (Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)).
Gopalan, N. P. ; Kathiresan, Nagarajan. / Self-refined fault tolerance in HPC using dynamic dependent process groups. Distributed Computing - IWDC 2005 - 7th International Workshop, Proceedings. Vol. 3741 LNCS 2005. pp. 153-158 (Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)).
@inproceedings{0d682b00bff04de7b114f5c756459fb2,
title = "Self-refined fault tolerance in HPC using dynamic dependent process groups",
abstract = "This paper proposes a novel method for achieving a distributed self-refined fault tolerance by dynamically partitioning the processes into smaller groups, which are mutually disjoint and collectively exhaustive of the whole system. The present model provides tolerance for frequent faults, makes the roll back recovery simple and less time consuming. An optimal checkpoint interval is found using a mathematical approximation and a spare process is made to capture all the in-transit messages when a process fails at its ends. Piggybacking the events of dependent processes on the outgoing messages is used for process grouping. A process with maximum information can scatter chunk values to the other dependent processes in its group. Each process constructs a checkpoint when the received chunk matches with its log.",
author = "Gopalan, {N. P.} and Nagarajan Kathiresan",
year = "2005",
language = "English",
isbn = "3540309594",
volume = "3741 LNCS",
series = "Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)",
pages = "153--158",
booktitle = "Distributed Computing - IWDC 2005 - 7th International Workshop, Proceedings",

}

TY - GEN

T1 - Self-refined fault tolerance in HPC using dynamic dependent process groups

AU - Gopalan, N. P.

AU - Kathiresan, Nagarajan

PY - 2005

Y1 - 2005

N2 - This paper proposes a novel method for achieving a distributed self-refined fault tolerance by dynamically partitioning the processes into smaller groups, which are mutually disjoint and collectively exhaustive of the whole system. The present model provides tolerance for frequent faults, makes the roll back recovery simple and less time consuming. An optimal checkpoint interval is found using a mathematical approximation and a spare process is made to capture all the in-transit messages when a process fails at its ends. Piggybacking the events of dependent processes on the outgoing messages is used for process grouping. A process with maximum information can scatter chunk values to the other dependent processes in its group. Each process constructs a checkpoint when the received chunk matches with its log.

AB - This paper proposes a novel method for achieving a distributed self-refined fault tolerance by dynamically partitioning the processes into smaller groups, which are mutually disjoint and collectively exhaustive of the whole system. The present model provides tolerance for frequent faults, makes the roll back recovery simple and less time consuming. An optimal checkpoint interval is found using a mathematical approximation and a spare process is made to capture all the in-transit messages when a process fails at its ends. Piggybacking the events of dependent processes on the outgoing messages is used for process grouping. A process with maximum information can scatter chunk values to the other dependent processes in its group. Each process constructs a checkpoint when the received chunk matches with its log.

UR - http://www.scopus.com/inward/record.url?scp=33745305678&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=33745305678&partnerID=8YFLogxK

M3 - Conference contribution

SN - 3540309594

SN - 9783540309598

VL - 3741 LNCS

T3 - Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)

SP - 153

EP - 158

BT - Distributed Computing - IWDC 2005 - 7th International Workshop, Proceedings

ER -