RAFTing MapReduce: Fast recovery on the RAFT

Jorge Arnulfo Quiane Ruiz, Christoph Pinkel, Jorg Schad, Jens Dittrich

Research output: Chapter in Book/Report/Conference proceedingConference contribution

41 Citations (Scopus)


MapReduce is a computing paradigm that has gained a lot of popularity as it allows non-expert users to easily run complex analytical tasks at very large-scale. At such scale, task and node failures are no longer an exception but rather a characteristic of large-scale systems. This makes fault-tolerance a critical issue for the efficient operation of any application. MapReduce automatically reschedules failed tasks to available nodes, which in turn recompute such tasks from scratch. However, this policy can significantly decrease performance of applications. In this paper, we propose a family of Recovery Algorithms for Fast-Tracking (RAFT) MapReduce. As ease-of-use is a major feature of MapReduce, RAFT focuses on simplicity and also non-intrusiveness, in order to be implementation-independent. To efficiently recover from task failures, RAFT exploits the fact that MapReduce produces and persists intermediate results at several points in time. RAFT piggy-backs checkpoints on the task progress computation. To deal with multiple node failures, we propose query metadata checkpointing. We keep track of the mapping between input key-value pairs and intermediate data for all reduce tasks. Thereby, RAFT does not need to re-execute completed map tasks entirely. Instead RAFT only recomputes intermediate data that were processed for local reduce tasks and hence not shipped to another node for processing. We also introduce a scheduling strategy taking full advantage of these recovery algorithms. We implemented RAFT on top of Hadoop and evaluated it on a 45-node cluster using three common analytical tasks. Overall, our experimental results demonstrate that RAFT outperforms Hadoop runtimes by 23% on average under task and node failures. The results also show that RAFT has negligible runtime overhead.

Original languageEnglish
Title of host publicationProceedings - International Conference on Data Engineering
Number of pages12
Publication statusPublished - 6 Jun 2011
Externally publishedYes
Event2011 IEEE 27th International Conference on Data Engineering, ICDE 2011 - Hannover, Germany
Duration: 11 Apr 201116 Apr 2011


Other2011 IEEE 27th International Conference on Data Engineering, ICDE 2011


ASJC Scopus subject areas

  • Information Systems
  • Signal Processing
  • Software

Cite this

Quiane Ruiz, J. A., Pinkel, C., Schad, J., & Dittrich, J. (2011). RAFTing MapReduce: Fast recovery on the RAFT. In Proceedings - International Conference on Data Engineering (pp. 589-600). [5767877] https://doi.org/10.1109/ICDE.2011.5767877