FP-Hadoop: Efficient processing of skewed MapReduce jobs

Miguel Liroz-Gistau, Reza Akbarinia, Divyakant Agrawal, Patrick Valduriez

Research output: Contribution to journalArticle

17 Citations (Scopus)

Abstract

Nowadays, we are witnessing the fast production of very large amount of data, particularly by the users of online systems on the Web. However, processing this big data is very challenging since both space and computational requirements are hard to satisfy. One solution for dealing with such requirements is to take advantage of parallel frameworks, such as MapReduce or Spark, that allow to make powerful computing and storage units on top of ordinary machines. Although these key-based frameworks have been praised for their high scalability and fault tolerance, they show poor performance in the case of data skew. There are important cases where a high percentage of processing in the reduce side ends up being done by only one node. In this paper, we present FP-Hadoop, a Hadoop-based system that renders the reduce side of MapReduce more parallel by efficiently tackling the problem of reduce data skew. FP-Hadoop introduces a new phase, denoted intermediate reduce (IR), where blocks of intermediate values are processed by intermediate reduce workers in parallel. With this approach, even when all intermediate values are associated to the same key, the main part of the reducing work can be performed in parallel taking benefit of the computing power of all available workers. We implemented a prototype of FP-Hadoop, and conducted extensive experiments over synthetic and real datasets. We achieved excellent performance gains compared to native Hadoop, e.g. more than 10 times in reduce time and 5 times in total execution time.

Original languageEnglish
Pages (from-to)69-84
Number of pages16
JournalInformation Systems
Volume60
DOIs
Publication statusPublished - 1 Aug 2016
Externally publishedYes

Fingerprint

Online systems
Processing
Fault tolerance
Electric sparks
Scalability
Experiments
Big data

Keywords

  • Data skew
  • MapReduce
  • Parallel data processing

ASJC Scopus subject areas

  • Hardware and Architecture
  • Information Systems
  • Software

Cite this

Liroz-Gistau, M., Akbarinia, R., Agrawal, D., & Valduriez, P. (2016). FP-Hadoop: Efficient processing of skewed MapReduce jobs. Information Systems, 60, 69-84. https://doi.org/10.1016/j.is.2016.03.008

FP-Hadoop : Efficient processing of skewed MapReduce jobs. / Liroz-Gistau, Miguel; Akbarinia, Reza; Agrawal, Divyakant; Valduriez, Patrick.

In: Information Systems, Vol. 60, 01.08.2016, p. 69-84.

Research output: Contribution to journalArticle

Liroz-Gistau, M, Akbarinia, R, Agrawal, D & Valduriez, P 2016, 'FP-Hadoop: Efficient processing of skewed MapReduce jobs', Information Systems, vol. 60, pp. 69-84. https://doi.org/10.1016/j.is.2016.03.008
Liroz-Gistau M, Akbarinia R, Agrawal D, Valduriez P. FP-Hadoop: Efficient processing of skewed MapReduce jobs. Information Systems. 2016 Aug 1;60:69-84. https://doi.org/10.1016/j.is.2016.03.008
Liroz-Gistau, Miguel ; Akbarinia, Reza ; Agrawal, Divyakant ; Valduriez, Patrick. / FP-Hadoop : Efficient processing of skewed MapReduce jobs. In: Information Systems. 2016 ; Vol. 60. pp. 69-84.
@article{9b0ef808ad564aed9ce930a8d0f295d6,
title = "FP-Hadoop: Efficient processing of skewed MapReduce jobs",
abstract = "Nowadays, we are witnessing the fast production of very large amount of data, particularly by the users of online systems on the Web. However, processing this big data is very challenging since both space and computational requirements are hard to satisfy. One solution for dealing with such requirements is to take advantage of parallel frameworks, such as MapReduce or Spark, that allow to make powerful computing and storage units on top of ordinary machines. Although these key-based frameworks have been praised for their high scalability and fault tolerance, they show poor performance in the case of data skew. There are important cases where a high percentage of processing in the reduce side ends up being done by only one node. In this paper, we present FP-Hadoop, a Hadoop-based system that renders the reduce side of MapReduce more parallel by efficiently tackling the problem of reduce data skew. FP-Hadoop introduces a new phase, denoted intermediate reduce (IR), where blocks of intermediate values are processed by intermediate reduce workers in parallel. With this approach, even when all intermediate values are associated to the same key, the main part of the reducing work can be performed in parallel taking benefit of the computing power of all available workers. We implemented a prototype of FP-Hadoop, and conducted extensive experiments over synthetic and real datasets. We achieved excellent performance gains compared to native Hadoop, e.g. more than 10 times in reduce time and 5 times in total execution time.",
keywords = "Data skew, MapReduce, Parallel data processing",
author = "Miguel Liroz-Gistau and Reza Akbarinia and Divyakant Agrawal and Patrick Valduriez",
year = "2016",
month = "8",
day = "1",
doi = "10.1016/j.is.2016.03.008",
language = "English",
volume = "60",
pages = "69--84",
journal = "Information Systems",
issn = "0306-4379",
publisher = "Elsevier Limited",

}

TY - JOUR

T1 - FP-Hadoop

T2 - Efficient processing of skewed MapReduce jobs

AU - Liroz-Gistau, Miguel

AU - Akbarinia, Reza

AU - Agrawal, Divyakant

AU - Valduriez, Patrick

PY - 2016/8/1

Y1 - 2016/8/1

N2 - Nowadays, we are witnessing the fast production of very large amount of data, particularly by the users of online systems on the Web. However, processing this big data is very challenging since both space and computational requirements are hard to satisfy. One solution for dealing with such requirements is to take advantage of parallel frameworks, such as MapReduce or Spark, that allow to make powerful computing and storage units on top of ordinary machines. Although these key-based frameworks have been praised for their high scalability and fault tolerance, they show poor performance in the case of data skew. There are important cases where a high percentage of processing in the reduce side ends up being done by only one node. In this paper, we present FP-Hadoop, a Hadoop-based system that renders the reduce side of MapReduce more parallel by efficiently tackling the problem of reduce data skew. FP-Hadoop introduces a new phase, denoted intermediate reduce (IR), where blocks of intermediate values are processed by intermediate reduce workers in parallel. With this approach, even when all intermediate values are associated to the same key, the main part of the reducing work can be performed in parallel taking benefit of the computing power of all available workers. We implemented a prototype of FP-Hadoop, and conducted extensive experiments over synthetic and real datasets. We achieved excellent performance gains compared to native Hadoop, e.g. more than 10 times in reduce time and 5 times in total execution time.

AB - Nowadays, we are witnessing the fast production of very large amount of data, particularly by the users of online systems on the Web. However, processing this big data is very challenging since both space and computational requirements are hard to satisfy. One solution for dealing with such requirements is to take advantage of parallel frameworks, such as MapReduce or Spark, that allow to make powerful computing and storage units on top of ordinary machines. Although these key-based frameworks have been praised for their high scalability and fault tolerance, they show poor performance in the case of data skew. There are important cases where a high percentage of processing in the reduce side ends up being done by only one node. In this paper, we present FP-Hadoop, a Hadoop-based system that renders the reduce side of MapReduce more parallel by efficiently tackling the problem of reduce data skew. FP-Hadoop introduces a new phase, denoted intermediate reduce (IR), where blocks of intermediate values are processed by intermediate reduce workers in parallel. With this approach, even when all intermediate values are associated to the same key, the main part of the reducing work can be performed in parallel taking benefit of the computing power of all available workers. We implemented a prototype of FP-Hadoop, and conducted extensive experiments over synthetic and real datasets. We achieved excellent performance gains compared to native Hadoop, e.g. more than 10 times in reduce time and 5 times in total execution time.

KW - Data skew

KW - MapReduce

KW - Parallel data processing

UR - http://www.scopus.com/inward/record.url?scp=84963685064&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=84963685064&partnerID=8YFLogxK

U2 - 10.1016/j.is.2016.03.008

DO - 10.1016/j.is.2016.03.008

M3 - Article

AN - SCOPUS:84963685064

VL - 60

SP - 69

EP - 84

JO - Information Systems

JF - Information Systems

SN - 0306-4379

ER -