RAFT at work

Speeding-up mapreduce applications under task and node failures

Jorge Arnulfo Quiane Ruiz, Christoph Pinkel, Jörg Schad, Jens Dittrich

Research output: Chapter in Book/Report/Conference proceedingConference contribution

7 Citations (Scopus)

Abstract

The MapReduce framework is typically deployed on very large computing clusters where task and node failures are no longer an exception but the rule. Thus, fault-tolerance is an important aspect for the efficient operation of MapReduce jobs. However, currently MapReduce implementations fully recompute failed tasks (subparts of a job) from the beginning. This can significantly decrease the runtime performance of MapReduce applications. We present an alternative system that implements RAFT ideas. RAFT is a family of powerful and inexpensive Recovery Algorithms for Fast-Tracking MapReduce jobs under task and node failures. To recover from task failures, RAFT exploits the intermediate results persisted by MapReduce at several points in time. RAFT piggybacks checkpoints on the task progress computation. To recover from node failures, RAFT maintains a per-map task list of all input key-value pairs producing intermediate results and pushes intermediate results to reducers. In this demo, we demonstrate that RAFT recovers efficiently from both task and node failures. Further, the audience can compare RAFT with Hadoop via an easy-to-use web interface.

Original languageEnglish
Title of host publicationProceedings of the ACM SIGMOD International Conference on Management of Data
Pages1225-1227
Number of pages3
DOIs
Publication statusPublished - 11 Jul 2011
Externally publishedYes
Event2011 ACM SIGMOD and 30th PODS 2011 Conference - Athens, Greece
Duration: 12 Jun 201116 Jun 2011

Other

Other2011 ACM SIGMOD and 30th PODS 2011 Conference
CountryGreece
CityAthens
Period12/6/1116/6/11

Fingerprint

Cluster computing
Fault tolerance
Recovery

Keywords

  • checkpointing
  • fault-tolerance
  • hadoop
  • mapreduce
  • node failures
  • recovery

ASJC Scopus subject areas

  • Software
  • Information Systems

Cite this

Quiane Ruiz, J. A., Pinkel, C., Schad, J., & Dittrich, J. (2011). RAFT at work: Speeding-up mapreduce applications under task and node failures. In Proceedings of the ACM SIGMOD International Conference on Management of Data (pp. 1225-1227) https://doi.org/10.1145/1989323.1989460

RAFT at work : Speeding-up mapreduce applications under task and node failures. / Quiane Ruiz, Jorge Arnulfo; Pinkel, Christoph; Schad, Jörg; Dittrich, Jens.

Proceedings of the ACM SIGMOD International Conference on Management of Data. 2011. p. 1225-1227.

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Quiane Ruiz, JA, Pinkel, C, Schad, J & Dittrich, J 2011, RAFT at work: Speeding-up mapreduce applications under task and node failures. in Proceedings of the ACM SIGMOD International Conference on Management of Data. pp. 1225-1227, 2011 ACM SIGMOD and 30th PODS 2011 Conference, Athens, Greece, 12/6/11. https://doi.org/10.1145/1989323.1989460
Quiane Ruiz JA, Pinkel C, Schad J, Dittrich J. RAFT at work: Speeding-up mapreduce applications under task and node failures. In Proceedings of the ACM SIGMOD International Conference on Management of Data. 2011. p. 1225-1227 https://doi.org/10.1145/1989323.1989460
Quiane Ruiz, Jorge Arnulfo ; Pinkel, Christoph ; Schad, Jörg ; Dittrich, Jens. / RAFT at work : Speeding-up mapreduce applications under task and node failures. Proceedings of the ACM SIGMOD International Conference on Management of Data. 2011. pp. 1225-1227
@inproceedings{7c110c7ef4cf496a92e5096ec6a8c2e3,
title = "RAFT at work: Speeding-up mapreduce applications under task and node failures",
abstract = "The MapReduce framework is typically deployed on very large computing clusters where task and node failures are no longer an exception but the rule. Thus, fault-tolerance is an important aspect for the efficient operation of MapReduce jobs. However, currently MapReduce implementations fully recompute failed tasks (subparts of a job) from the beginning. This can significantly decrease the runtime performance of MapReduce applications. We present an alternative system that implements RAFT ideas. RAFT is a family of powerful and inexpensive Recovery Algorithms for Fast-Tracking MapReduce jobs under task and node failures. To recover from task failures, RAFT exploits the intermediate results persisted by MapReduce at several points in time. RAFT piggybacks checkpoints on the task progress computation. To recover from node failures, RAFT maintains a per-map task list of all input key-value pairs producing intermediate results and pushes intermediate results to reducers. In this demo, we demonstrate that RAFT recovers efficiently from both task and node failures. Further, the audience can compare RAFT with Hadoop via an easy-to-use web interface.",
keywords = "checkpointing, fault-tolerance, hadoop, mapreduce, node failures, recovery",
author = "{Quiane Ruiz}, {Jorge Arnulfo} and Christoph Pinkel and J{\"o}rg Schad and Jens Dittrich",
year = "2011",
month = "7",
day = "11",
doi = "10.1145/1989323.1989460",
language = "English",
isbn = "9781450306614",
pages = "1225--1227",
booktitle = "Proceedings of the ACM SIGMOD International Conference on Management of Data",

}

TY - GEN

T1 - RAFT at work

T2 - Speeding-up mapreduce applications under task and node failures

AU - Quiane Ruiz, Jorge Arnulfo

AU - Pinkel, Christoph

AU - Schad, Jörg

AU - Dittrich, Jens

PY - 2011/7/11

Y1 - 2011/7/11

N2 - The MapReduce framework is typically deployed on very large computing clusters where task and node failures are no longer an exception but the rule. Thus, fault-tolerance is an important aspect for the efficient operation of MapReduce jobs. However, currently MapReduce implementations fully recompute failed tasks (subparts of a job) from the beginning. This can significantly decrease the runtime performance of MapReduce applications. We present an alternative system that implements RAFT ideas. RAFT is a family of powerful and inexpensive Recovery Algorithms for Fast-Tracking MapReduce jobs under task and node failures. To recover from task failures, RAFT exploits the intermediate results persisted by MapReduce at several points in time. RAFT piggybacks checkpoints on the task progress computation. To recover from node failures, RAFT maintains a per-map task list of all input key-value pairs producing intermediate results and pushes intermediate results to reducers. In this demo, we demonstrate that RAFT recovers efficiently from both task and node failures. Further, the audience can compare RAFT with Hadoop via an easy-to-use web interface.

AB - The MapReduce framework is typically deployed on very large computing clusters where task and node failures are no longer an exception but the rule. Thus, fault-tolerance is an important aspect for the efficient operation of MapReduce jobs. However, currently MapReduce implementations fully recompute failed tasks (subparts of a job) from the beginning. This can significantly decrease the runtime performance of MapReduce applications. We present an alternative system that implements RAFT ideas. RAFT is a family of powerful and inexpensive Recovery Algorithms for Fast-Tracking MapReduce jobs under task and node failures. To recover from task failures, RAFT exploits the intermediate results persisted by MapReduce at several points in time. RAFT piggybacks checkpoints on the task progress computation. To recover from node failures, RAFT maintains a per-map task list of all input key-value pairs producing intermediate results and pushes intermediate results to reducers. In this demo, we demonstrate that RAFT recovers efficiently from both task and node failures. Further, the audience can compare RAFT with Hadoop via an easy-to-use web interface.

KW - checkpointing

KW - fault-tolerance

KW - hadoop

KW - mapreduce

KW - node failures

KW - recovery

UR - http://www.scopus.com/inward/record.url?scp=79960006168&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=79960006168&partnerID=8YFLogxK

U2 - 10.1145/1989323.1989460

DO - 10.1145/1989323.1989460

M3 - Conference contribution

SN - 9781450306614

SP - 1225

EP - 1227

BT - Proceedings of the ACM SIGMOD International Conference on Management of Data

ER -