An efficient solution for processing skewed MapReduce jobs

Reza Akbarinia, Miguel Liroz-Gistau, Divyakant Agrawal, Patrick Valduriez

Research output: Chapter in Book/Report/Conference proceedingConference contribution

1 Citation (Scopus)

Abstract

Although MapReduce has been praised for its high scalability and fault tolerance, it has been criticized in some points, in particular, its poor performance in the case of data skew. There are important cases where a high percentage of processing in the reduce side is done by a few nodes, or even one node, while the others remain idle. There have been some attempts to address the problem of data skew, but only for specific cases. In particular, there is no proposed solution for the cases where most of the intermediate values correspond to a single key, or when the number of keys is less than the number of reduce workers. In this paper, we propose FP-Hadoop, a system that makes the reduce side of MapReduce more parallel, and efficiently deals with the problem of data skew in the reduce side. In FP-Hadoop, there is a new phase, called intermediate reduce (IR), in which blocks of intermediate values, constructed dynamically, are processed by intermediate reduce workers in parallel, by using a scheduling strategy. By using the IR phase, even if all intermediate values belong to only one key, the main part of the reducing work can be done in parallel by using the computing resources of all available workers. We implemented a prototype of FP-Hadoop, and conducted extensive experiments over synthetic and real datasets. We achieved excellent performance gains compared to native Hadoop, e.g. more than 10 times in reduce time and 5 times in total execution time.

Original languageEnglish
Title of host publicationLecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)
PublisherSpringer Verlag
Pages417-429
Number of pages13
Volume9262
ISBN (Print)9783319228518
DOIs
Publication statusPublished - 2015
Externally publishedYes
Event26th International Conference on Database and Expert Systems Applications, DEXA 2015 - Valencia, Spain
Duration: 1 Sep 20154 Sep 2015

Publication series

NameLecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)
Volume9262
ISSN (Print)03029743
ISSN (Electronic)16113349

Other

Other26th International Conference on Database and Expert Systems Applications, DEXA 2015
CountrySpain
CityValencia
Period1/9/154/9/15

Fingerprint

MapReduce
Fault tolerance
Efficient Solution
Scalability
Scheduling
Processing
Experiments
Skew
Vertex of a graph
Fault Tolerance
Execution Time
Percentage
Prototype
Resources
Computing

Keywords

  • Data skew
  • Load balancing
  • MapReduce

ASJC Scopus subject areas

  • Computer Science(all)
  • Theoretical Computer Science

Cite this

Akbarinia, R., Liroz-Gistau, M., Agrawal, D., & Valduriez, P. (2015). An efficient solution for processing skewed MapReduce jobs. In Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics) (Vol. 9262, pp. 417-429). (Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics); Vol. 9262). Springer Verlag. https://doi.org/10.1007/978-3-319-22852-5_35

An efficient solution for processing skewed MapReduce jobs. / Akbarinia, Reza; Liroz-Gistau, Miguel; Agrawal, Divyakant; Valduriez, Patrick.

Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics). Vol. 9262 Springer Verlag, 2015. p. 417-429 (Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics); Vol. 9262).

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Akbarinia, R, Liroz-Gistau, M, Agrawal, D & Valduriez, P 2015, An efficient solution for processing skewed MapReduce jobs. in Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics). vol. 9262, Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics), vol. 9262, Springer Verlag, pp. 417-429, 26th International Conference on Database and Expert Systems Applications, DEXA 2015, Valencia, Spain, 1/9/15. https://doi.org/10.1007/978-3-319-22852-5_35
Akbarinia R, Liroz-Gistau M, Agrawal D, Valduriez P. An efficient solution for processing skewed MapReduce jobs. In Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics). Vol. 9262. Springer Verlag. 2015. p. 417-429. (Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)). https://doi.org/10.1007/978-3-319-22852-5_35
Akbarinia, Reza ; Liroz-Gistau, Miguel ; Agrawal, Divyakant ; Valduriez, Patrick. / An efficient solution for processing skewed MapReduce jobs. Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics). Vol. 9262 Springer Verlag, 2015. pp. 417-429 (Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)).
@inproceedings{549a11e4f3684b058af5442f37612e4b,
title = "An efficient solution for processing skewed MapReduce jobs",
abstract = "Although MapReduce has been praised for its high scalability and fault tolerance, it has been criticized in some points, in particular, its poor performance in the case of data skew. There are important cases where a high percentage of processing in the reduce side is done by a few nodes, or even one node, while the others remain idle. There have been some attempts to address the problem of data skew, but only for specific cases. In particular, there is no proposed solution for the cases where most of the intermediate values correspond to a single key, or when the number of keys is less than the number of reduce workers. In this paper, we propose FP-Hadoop, a system that makes the reduce side of MapReduce more parallel, and efficiently deals with the problem of data skew in the reduce side. In FP-Hadoop, there is a new phase, called intermediate reduce (IR), in which blocks of intermediate values, constructed dynamically, are processed by intermediate reduce workers in parallel, by using a scheduling strategy. By using the IR phase, even if all intermediate values belong to only one key, the main part of the reducing work can be done in parallel by using the computing resources of all available workers. We implemented a prototype of FP-Hadoop, and conducted extensive experiments over synthetic and real datasets. We achieved excellent performance gains compared to native Hadoop, e.g. more than 10 times in reduce time and 5 times in total execution time.",
keywords = "Data skew, Load balancing, MapReduce",
author = "Reza Akbarinia and Miguel Liroz-Gistau and Divyakant Agrawal and Patrick Valduriez",
year = "2015",
doi = "10.1007/978-3-319-22852-5_35",
language = "English",
isbn = "9783319228518",
volume = "9262",
series = "Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)",
publisher = "Springer Verlag",
pages = "417--429",
booktitle = "Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)",

}

TY - GEN

T1 - An efficient solution for processing skewed MapReduce jobs

AU - Akbarinia, Reza

AU - Liroz-Gistau, Miguel

AU - Agrawal, Divyakant

AU - Valduriez, Patrick

PY - 2015

Y1 - 2015

N2 - Although MapReduce has been praised for its high scalability and fault tolerance, it has been criticized in some points, in particular, its poor performance in the case of data skew. There are important cases where a high percentage of processing in the reduce side is done by a few nodes, or even one node, while the others remain idle. There have been some attempts to address the problem of data skew, but only for specific cases. In particular, there is no proposed solution for the cases where most of the intermediate values correspond to a single key, or when the number of keys is less than the number of reduce workers. In this paper, we propose FP-Hadoop, a system that makes the reduce side of MapReduce more parallel, and efficiently deals with the problem of data skew in the reduce side. In FP-Hadoop, there is a new phase, called intermediate reduce (IR), in which blocks of intermediate values, constructed dynamically, are processed by intermediate reduce workers in parallel, by using a scheduling strategy. By using the IR phase, even if all intermediate values belong to only one key, the main part of the reducing work can be done in parallel by using the computing resources of all available workers. We implemented a prototype of FP-Hadoop, and conducted extensive experiments over synthetic and real datasets. We achieved excellent performance gains compared to native Hadoop, e.g. more than 10 times in reduce time and 5 times in total execution time.

AB - Although MapReduce has been praised for its high scalability and fault tolerance, it has been criticized in some points, in particular, its poor performance in the case of data skew. There are important cases where a high percentage of processing in the reduce side is done by a few nodes, or even one node, while the others remain idle. There have been some attempts to address the problem of data skew, but only for specific cases. In particular, there is no proposed solution for the cases where most of the intermediate values correspond to a single key, or when the number of keys is less than the number of reduce workers. In this paper, we propose FP-Hadoop, a system that makes the reduce side of MapReduce more parallel, and efficiently deals with the problem of data skew in the reduce side. In FP-Hadoop, there is a new phase, called intermediate reduce (IR), in which blocks of intermediate values, constructed dynamically, are processed by intermediate reduce workers in parallel, by using a scheduling strategy. By using the IR phase, even if all intermediate values belong to only one key, the main part of the reducing work can be done in parallel by using the computing resources of all available workers. We implemented a prototype of FP-Hadoop, and conducted extensive experiments over synthetic and real datasets. We achieved excellent performance gains compared to native Hadoop, e.g. more than 10 times in reduce time and 5 times in total execution time.

KW - Data skew

KW - Load balancing

KW - MapReduce

UR - http://www.scopus.com/inward/record.url?scp=84943608755&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=84943608755&partnerID=8YFLogxK

U2 - 10.1007/978-3-319-22852-5_35

DO - 10.1007/978-3-319-22852-5_35

M3 - Conference contribution

SN - 9783319228518

VL - 9262

T3 - Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)

SP - 417

EP - 429

BT - Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)

PB - Springer Verlag

ER -