Llama: Leveraging columnar storage for scalable join processing in the MapReduce framework

Yuting Lin, Divyakant Agrawal, Chun Chen, Beng Chin Ooi, Sai Wu

Research output: Chapter in Book/Report/Conference proceedingConference contribution

93 Citations (Scopus)

Abstract

To achieve high reliability and scalability, most large-scale data warehouse systems have adopted the cluster-based architecture. In this paper, we propose the design of a new cluster-based data warehouse system, LLama, a hybrid data management system which combines the features of row-wise and column-wise database systems. In Llama, columns are formed into correlation groups to provide the basis for the vertical partitioning of tables. Llama employs a distributed file system (DFS) to disseminate data among cluster nodes. Above the DFS, a MapReduce-based query engine is supported. We design a new join algorithm to facilitate fast join processing. We present a performance study on TPC-H dataset and compare Llama with Hive, a data warehouse infrastructure built on top of Hadoop. The experiment is conducted on EC2. The results show that Llama has an excellent load performance and its query performance is significantly better than the traditional MapReduce framework based on row-wise storage.

Original languageEnglish
Title of host publicationProceedings of SIGMOD 2011 and PODS 2011
Pages961-972
Number of pages12
DOIs
Publication statusPublished - 11 Jul 2011
Event2011 ACM SIGMOD and 30th PODS 2011 Conference - Athens, Greece
Duration: 12 Jun 201116 Jun 2011

Publication series

NameProceedings of the ACM SIGMOD International Conference on Management of Data
ISSN (Print)0730-8078

Other

Other2011 ACM SIGMOD and 30th PODS 2011 Conference
CountryGreece
CityAthens
Period12/6/1116/6/11

    Fingerprint

Keywords

  • MapReduce
  • column store
  • join

ASJC Scopus subject areas

  • Software
  • Information Systems

Cite this

Lin, Y., Agrawal, D., Chen, C., Ooi, B. C., & Wu, S. (2011). Llama: Leveraging columnar storage for scalable join processing in the MapReduce framework. In Proceedings of SIGMOD 2011 and PODS 2011 (pp. 961-972). (Proceedings of the ACM SIGMOD International Conference on Management of Data). https://doi.org/10.1145/1989323.1989424