Data partitioning for minimizing transferred data in mapreduce

Miguel Liroz-Gistau, Reza Akbarinia, Divyakant Agrawal, Esther Pacitti, Patrick Valduriez

Research output: Chapter in Book/Report/Conference proceedingConference contribution

5 Citations (Scopus)

Abstract

Reducing data transfer in MapReduce's shuffle phase is very important because it increases data locality of reduce tasks, and thus decreases the overhead of job executions. In the literature, several optimizations have been proposed to reduce data transfer between mappers and reducers. Nevertheless, all these approaches are limited by how intermediate key-value pairs are distributed over map outputs. In this paper, we address the problem of high data transfers in MapReduce, and propose a technique that repartitions tuples of the input datasets, and thereby optimizes the distribution of key-values over mappers, and increases the data locality in reduce tasks. Our approach captures the relationships between input tuples and intermediate keys by monitoring the execution of a set of MapReduce jobs which are representative of the workload. Then, based on those relationships, it assigns input tuples to the appropriate chunks. We evaluated our approach through experimentation in a Hadoop deployment on top of Grid5000 using standard benchmarks. The results show high reduction in data transfer during the shuffle phase compared to Native Hadoop.

Original languageEnglish
Title of host publicationLecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)
Pages1-12
Number of pages12
Volume8059 LNCS
DOIs
Publication statusPublished - 10 Oct 2013
Externally publishedYes
Event6th International Conference on Data Management in Grid and P2P Systems, Globe 2013 - Prague, Czech Republic
Duration: 28 Aug 201329 Aug 2013

Publication series

NameLecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)
Volume8059 LNCS
ISSN (Print)03029743
ISSN (Electronic)16113349

Other

Other6th International Conference on Data Management in Grid and P2P Systems, Globe 2013
CountryCzech Republic
CityPrague
Period28/8/1329/8/13

Fingerprint

Data Partitioning
MapReduce
Data Transfer
Data transfer
Data Locality
Shuffle
Experimentation
Workload
Assign
Optimise
Monitoring
Benchmark
Decrease
Optimization
Output
Relationships

ASJC Scopus subject areas

  • Computer Science(all)
  • Theoretical Computer Science

Cite this

Liroz-Gistau, M., Akbarinia, R., Agrawal, D., Pacitti, E., & Valduriez, P. (2013). Data partitioning for minimizing transferred data in mapreduce. In Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics) (Vol. 8059 LNCS, pp. 1-12). (Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics); Vol. 8059 LNCS). https://doi.org/10.1007/978-3-642-40053-7-1

Data partitioning for minimizing transferred data in mapreduce. / Liroz-Gistau, Miguel; Akbarinia, Reza; Agrawal, Divyakant; Pacitti, Esther; Valduriez, Patrick.

Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics). Vol. 8059 LNCS 2013. p. 1-12 (Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics); Vol. 8059 LNCS).

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Liroz-Gistau, M, Akbarinia, R, Agrawal, D, Pacitti, E & Valduriez, P 2013, Data partitioning for minimizing transferred data in mapreduce. in Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics). vol. 8059 LNCS, Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics), vol. 8059 LNCS, pp. 1-12, 6th International Conference on Data Management in Grid and P2P Systems, Globe 2013, Prague, Czech Republic, 28/8/13. https://doi.org/10.1007/978-3-642-40053-7-1
Liroz-Gistau M, Akbarinia R, Agrawal D, Pacitti E, Valduriez P. Data partitioning for minimizing transferred data in mapreduce. In Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics). Vol. 8059 LNCS. 2013. p. 1-12. (Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)). https://doi.org/10.1007/978-3-642-40053-7-1
Liroz-Gistau, Miguel ; Akbarinia, Reza ; Agrawal, Divyakant ; Pacitti, Esther ; Valduriez, Patrick. / Data partitioning for minimizing transferred data in mapreduce. Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics). Vol. 8059 LNCS 2013. pp. 1-12 (Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)).
@inproceedings{aec074ae30d64444a8ee985018084f6f,
title = "Data partitioning for minimizing transferred data in mapreduce",
abstract = "Reducing data transfer in MapReduce's shuffle phase is very important because it increases data locality of reduce tasks, and thus decreases the overhead of job executions. In the literature, several optimizations have been proposed to reduce data transfer between mappers and reducers. Nevertheless, all these approaches are limited by how intermediate key-value pairs are distributed over map outputs. In this paper, we address the problem of high data transfers in MapReduce, and propose a technique that repartitions tuples of the input datasets, and thereby optimizes the distribution of key-values over mappers, and increases the data locality in reduce tasks. Our approach captures the relationships between input tuples and intermediate keys by monitoring the execution of a set of MapReduce jobs which are representative of the workload. Then, based on those relationships, it assigns input tuples to the appropriate chunks. We evaluated our approach through experimentation in a Hadoop deployment on top of Grid5000 using standard benchmarks. The results show high reduction in data transfer during the shuffle phase compared to Native Hadoop.",
author = "Miguel Liroz-Gistau and Reza Akbarinia and Divyakant Agrawal and Esther Pacitti and Patrick Valduriez",
year = "2013",
month = "10",
day = "10",
doi = "10.1007/978-3-642-40053-7-1",
language = "English",
isbn = "9783642400520",
volume = "8059 LNCS",
series = "Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)",
pages = "1--12",
booktitle = "Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)",

}

TY - GEN

T1 - Data partitioning for minimizing transferred data in mapreduce

AU - Liroz-Gistau, Miguel

AU - Akbarinia, Reza

AU - Agrawal, Divyakant

AU - Pacitti, Esther

AU - Valduriez, Patrick

PY - 2013/10/10

Y1 - 2013/10/10

N2 - Reducing data transfer in MapReduce's shuffle phase is very important because it increases data locality of reduce tasks, and thus decreases the overhead of job executions. In the literature, several optimizations have been proposed to reduce data transfer between mappers and reducers. Nevertheless, all these approaches are limited by how intermediate key-value pairs are distributed over map outputs. In this paper, we address the problem of high data transfers in MapReduce, and propose a technique that repartitions tuples of the input datasets, and thereby optimizes the distribution of key-values over mappers, and increases the data locality in reduce tasks. Our approach captures the relationships between input tuples and intermediate keys by monitoring the execution of a set of MapReduce jobs which are representative of the workload. Then, based on those relationships, it assigns input tuples to the appropriate chunks. We evaluated our approach through experimentation in a Hadoop deployment on top of Grid5000 using standard benchmarks. The results show high reduction in data transfer during the shuffle phase compared to Native Hadoop.

AB - Reducing data transfer in MapReduce's shuffle phase is very important because it increases data locality of reduce tasks, and thus decreases the overhead of job executions. In the literature, several optimizations have been proposed to reduce data transfer between mappers and reducers. Nevertheless, all these approaches are limited by how intermediate key-value pairs are distributed over map outputs. In this paper, we address the problem of high data transfers in MapReduce, and propose a technique that repartitions tuples of the input datasets, and thereby optimizes the distribution of key-values over mappers, and increases the data locality in reduce tasks. Our approach captures the relationships between input tuples and intermediate keys by monitoring the execution of a set of MapReduce jobs which are representative of the workload. Then, based on those relationships, it assigns input tuples to the appropriate chunks. We evaluated our approach through experimentation in a Hadoop deployment on top of Grid5000 using standard benchmarks. The results show high reduction in data transfer during the shuffle phase compared to Native Hadoop.

UR - http://www.scopus.com/inward/record.url?scp=84885064981&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=84885064981&partnerID=8YFLogxK

U2 - 10.1007/978-3-642-40053-7-1

DO - 10.1007/978-3-642-40053-7-1

M3 - Conference contribution

SN - 9783642400520

VL - 8059 LNCS

T3 - Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)

SP - 1

EP - 12

BT - Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)

ER -