Trojan data Layouts: Right shoes for a running elephant

Alekh Jindal, Jorge Arnulfo Quiane Ruiz, Jens Dittrich

Research output: Chapter in Book/Report/Conference proceedingConference contribution

69 Citations (Scopus)

Abstract

MapReduce is becoming ubiquitous in large-scale data analysis. Several recent works have shown that the performance of Hadoop MapReduce could be improved, for instance, by creating indexes in a non-invasive manner. However, they ignore the impact of the data layout used inside data blocks of Hadoop Distributed File System (HDFS). In this paper, we analyze difierent data layouts in detail in the context of MapReduce and argue that Row, Column, and PAX layouts can lead to poor system performance. We propose a new data layout, coined Trojan Layout, that internally organizes data blocks into attribute groups according to the workload in order to improve data access times. A salient feature of Trojan Layout is that it fully preserves the fault-tolerance properties of MapReduce. We implement our Trojan Layout idea in HDFS 0.20.3 and call the resulting system Trojan HDFS. We exploit the fact that HDFS stores multiple replicas of each data block on difierent computing nodes. Trojan HDFS automatically creates a difierent Trojan Layout per replica to better fit the workload. As a result, we are able to schedule incoming MapReduce jobs to data block replicas with the most suitable Trojan Layout. We evaluate our approach using three real-world workloads. We compare Trojan Layouts against Hadoop using Row and PAX layouts. The results demonstrate that Trojan Layout allows MapReduce jobs to read their input data up to 4:8 times faster than Row layout and up to 3:5 times faster than PAX layout.

Original languageEnglish
Title of host publicationProceedings of the 2nd ACM Symposium on Cloud Computing, SOCC 2011
DOIs
Publication statusPublished - 30 Nov 2011
Externally publishedYes
Event2nd ACM Symposium on Cloud Computing, SOCC 2011 - Cascais, Portugal
Duration: 26 Oct 201128 Oct 2011

Other

Other2nd ACM Symposium on Cloud Computing, SOCC 2011
CountryPortugal
CityCascais
Period26/10/1128/10/11

Fingerprint

Fault tolerance

Keywords

  • Column grouping
  • MapReduce
  • Per-replica data layout

ASJC Scopus subject areas

  • Software

Cite this

Jindal, A., Quiane Ruiz, J. A., & Dittrich, J. (2011). Trojan data Layouts: Right shoes for a running elephant. In Proceedings of the 2nd ACM Symposium on Cloud Computing, SOCC 2011 [a21] https://doi.org/10.1145/2038916.2038937

Trojan data Layouts : Right shoes for a running elephant. / Jindal, Alekh; Quiane Ruiz, Jorge Arnulfo; Dittrich, Jens.

Proceedings of the 2nd ACM Symposium on Cloud Computing, SOCC 2011. 2011. a21.

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Jindal, A, Quiane Ruiz, JA & Dittrich, J 2011, Trojan data Layouts: Right shoes for a running elephant. in Proceedings of the 2nd ACM Symposium on Cloud Computing, SOCC 2011., a21, 2nd ACM Symposium on Cloud Computing, SOCC 2011, Cascais, Portugal, 26/10/11. https://doi.org/10.1145/2038916.2038937
Jindal A, Quiane Ruiz JA, Dittrich J. Trojan data Layouts: Right shoes for a running elephant. In Proceedings of the 2nd ACM Symposium on Cloud Computing, SOCC 2011. 2011. a21 https://doi.org/10.1145/2038916.2038937
Jindal, Alekh ; Quiane Ruiz, Jorge Arnulfo ; Dittrich, Jens. / Trojan data Layouts : Right shoes for a running elephant. Proceedings of the 2nd ACM Symposium on Cloud Computing, SOCC 2011. 2011.
@inproceedings{27ecda865e594c0c8b9517ffe540284a,
title = "Trojan data Layouts: Right shoes for a running elephant",
abstract = "MapReduce is becoming ubiquitous in large-scale data analysis. Several recent works have shown that the performance of Hadoop MapReduce could be improved, for instance, by creating indexes in a non-invasive manner. However, they ignore the impact of the data layout used inside data blocks of Hadoop Distributed File System (HDFS). In this paper, we analyze difierent data layouts in detail in the context of MapReduce and argue that Row, Column, and PAX layouts can lead to poor system performance. We propose a new data layout, coined Trojan Layout, that internally organizes data blocks into attribute groups according to the workload in order to improve data access times. A salient feature of Trojan Layout is that it fully preserves the fault-tolerance properties of MapReduce. We implement our Trojan Layout idea in HDFS 0.20.3 and call the resulting system Trojan HDFS. We exploit the fact that HDFS stores multiple replicas of each data block on difierent computing nodes. Trojan HDFS automatically creates a difierent Trojan Layout per replica to better fit the workload. As a result, we are able to schedule incoming MapReduce jobs to data block replicas with the most suitable Trojan Layout. We evaluate our approach using three real-world workloads. We compare Trojan Layouts against Hadoop using Row and PAX layouts. The results demonstrate that Trojan Layout allows MapReduce jobs to read their input data up to 4:8 times faster than Row layout and up to 3:5 times faster than PAX layout.",
keywords = "Column grouping, MapReduce, Per-replica data layout",
author = "Alekh Jindal and {Quiane Ruiz}, {Jorge Arnulfo} and Jens Dittrich",
year = "2011",
month = "11",
day = "30",
doi = "10.1145/2038916.2038937",
language = "English",
isbn = "9781450309769",
booktitle = "Proceedings of the 2nd ACM Symposium on Cloud Computing, SOCC 2011",

}

TY - GEN

T1 - Trojan data Layouts

T2 - Right shoes for a running elephant

AU - Jindal, Alekh

AU - Quiane Ruiz, Jorge Arnulfo

AU - Dittrich, Jens

PY - 2011/11/30

Y1 - 2011/11/30

N2 - MapReduce is becoming ubiquitous in large-scale data analysis. Several recent works have shown that the performance of Hadoop MapReduce could be improved, for instance, by creating indexes in a non-invasive manner. However, they ignore the impact of the data layout used inside data blocks of Hadoop Distributed File System (HDFS). In this paper, we analyze difierent data layouts in detail in the context of MapReduce and argue that Row, Column, and PAX layouts can lead to poor system performance. We propose a new data layout, coined Trojan Layout, that internally organizes data blocks into attribute groups according to the workload in order to improve data access times. A salient feature of Trojan Layout is that it fully preserves the fault-tolerance properties of MapReduce. We implement our Trojan Layout idea in HDFS 0.20.3 and call the resulting system Trojan HDFS. We exploit the fact that HDFS stores multiple replicas of each data block on difierent computing nodes. Trojan HDFS automatically creates a difierent Trojan Layout per replica to better fit the workload. As a result, we are able to schedule incoming MapReduce jobs to data block replicas with the most suitable Trojan Layout. We evaluate our approach using three real-world workloads. We compare Trojan Layouts against Hadoop using Row and PAX layouts. The results demonstrate that Trojan Layout allows MapReduce jobs to read their input data up to 4:8 times faster than Row layout and up to 3:5 times faster than PAX layout.

AB - MapReduce is becoming ubiquitous in large-scale data analysis. Several recent works have shown that the performance of Hadoop MapReduce could be improved, for instance, by creating indexes in a non-invasive manner. However, they ignore the impact of the data layout used inside data blocks of Hadoop Distributed File System (HDFS). In this paper, we analyze difierent data layouts in detail in the context of MapReduce and argue that Row, Column, and PAX layouts can lead to poor system performance. We propose a new data layout, coined Trojan Layout, that internally organizes data blocks into attribute groups according to the workload in order to improve data access times. A salient feature of Trojan Layout is that it fully preserves the fault-tolerance properties of MapReduce. We implement our Trojan Layout idea in HDFS 0.20.3 and call the resulting system Trojan HDFS. We exploit the fact that HDFS stores multiple replicas of each data block on difierent computing nodes. Trojan HDFS automatically creates a difierent Trojan Layout per replica to better fit the workload. As a result, we are able to schedule incoming MapReduce jobs to data block replicas with the most suitable Trojan Layout. We evaluate our approach using three real-world workloads. We compare Trojan Layouts against Hadoop using Row and PAX layouts. The results demonstrate that Trojan Layout allows MapReduce jobs to read their input data up to 4:8 times faster than Row layout and up to 3:5 times faster than PAX layout.

KW - Column grouping

KW - MapReduce

KW - Per-replica data layout

UR - http://www.scopus.com/inward/record.url?scp=82155168632&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=82155168632&partnerID=8YFLogxK

U2 - 10.1145/2038916.2038937

DO - 10.1145/2038916.2038937

M3 - Conference contribution

AN - SCOPUS:82155168632

SN - 9781450309769

BT - Proceedings of the 2nd ACM Symposium on Cloud Computing, SOCC 2011

ER -