Trojan data Layouts: Right shoes for a running elephant

Alekh Jindal, Jorge Arnulfo Quiane Ruiz, Jens Dittrich

Research output: Chapter in Book/Report/Conference proceedingConference contribution

70 Citations (Scopus)


MapReduce is becoming ubiquitous in large-scale data analysis. Several recent works have shown that the performance of Hadoop MapReduce could be improved, for instance, by creating indexes in a non-invasive manner. However, they ignore the impact of the data layout used inside data blocks of Hadoop Distributed File System (HDFS). In this paper, we analyze difierent data layouts in detail in the context of MapReduce and argue that Row, Column, and PAX layouts can lead to poor system performance. We propose a new data layout, coined Trojan Layout, that internally organizes data blocks into attribute groups according to the workload in order to improve data access times. A salient feature of Trojan Layout is that it fully preserves the fault-tolerance properties of MapReduce. We implement our Trojan Layout idea in HDFS 0.20.3 and call the resulting system Trojan HDFS. We exploit the fact that HDFS stores multiple replicas of each data block on difierent computing nodes. Trojan HDFS automatically creates a difierent Trojan Layout per replica to better fit the workload. As a result, we are able to schedule incoming MapReduce jobs to data block replicas with the most suitable Trojan Layout. We evaluate our approach using three real-world workloads. We compare Trojan Layouts against Hadoop using Row and PAX layouts. The results demonstrate that Trojan Layout allows MapReduce jobs to read their input data up to 4:8 times faster than Row layout and up to 3:5 times faster than PAX layout.

Original languageEnglish
Title of host publicationProceedings of the 2nd ACM Symposium on Cloud Computing, SOCC 2011
Publication statusPublished - 30 Nov 2011
Externally publishedYes
Event2nd ACM Symposium on Cloud Computing, SOCC 2011 - Cascais, Portugal
Duration: 26 Oct 201128 Oct 2011


Other2nd ACM Symposium on Cloud Computing, SOCC 2011



  • Column grouping
  • MapReduce
  • Per-replica data layout

ASJC Scopus subject areas

  • Software

Cite this

Jindal, A., Quiane Ruiz, J. A., & Dittrich, J. (2011). Trojan data Layouts: Right shoes for a running elephant. In Proceedings of the 2nd ACM Symposium on Cloud Computing, SOCC 2011 [a21]