Llama

Leveraging columnar storage for scalable join processing in the MapReduce framework

Yuting Lin, Divyakant Agrawal, Chun Chen, Beng Chin Ooi, Sai Wu

Research output: Chapter in Book/Report/Conference proceedingConference contribution

92 Citations (Scopus)

Abstract

To achieve high reliability and scalability, most large-scale data warehouse systems have adopted the cluster-based architecture. In this paper, we propose the design of a new cluster-based data warehouse system, LLama, a hybrid data management system which combines the features of row-wise and column-wise database systems. In Llama, columns are formed into correlation groups to provide the basis for the vertical partitioning of tables. Llama employs a distributed file system (DFS) to disseminate data among cluster nodes. Above the DFS, a MapReduce-based query engine is supported. We design a new join algorithm to facilitate fast join processing. We present a performance study on TPC-H dataset and compare Llama with Hive, a data warehouse infrastructure built on top of Hadoop. The experiment is conducted on EC2. The results show that Llama has an excellent load performance and its query performance is significantly better than the traditional MapReduce framework based on row-wise storage.

Original languageEnglish
Title of host publicationProceedings of the ACM SIGMOD International Conference on Management of Data
Pages961-972
Number of pages12
DOIs
Publication statusPublished - 11 Jul 2011
Externally publishedYes
Event2011 ACM SIGMOD and 30th PODS 2011 Conference - Athens, Greece
Duration: 12 Jun 201116 Jun 2011

Other

Other2011 ACM SIGMOD and 30th PODS 2011 Conference
CountryGreece
CityAthens
Period12/6/1116/6/11

Fingerprint

Data warehouses
Processing
Information management
Scalability
Engines
Experiments

Keywords

  • column store
  • join
  • MapReduce

ASJC Scopus subject areas

  • Software
  • Information Systems

Cite this

Lin, Y., Agrawal, D., Chen, C., Ooi, B. C., & Wu, S. (2011). Llama: Leveraging columnar storage for scalable join processing in the MapReduce framework. In Proceedings of the ACM SIGMOD International Conference on Management of Data (pp. 961-972) https://doi.org/10.1145/1989323.1989424

Llama : Leveraging columnar storage for scalable join processing in the MapReduce framework. / Lin, Yuting; Agrawal, Divyakant; Chen, Chun; Ooi, Beng Chin; Wu, Sai.

Proceedings of the ACM SIGMOD International Conference on Management of Data. 2011. p. 961-972.

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Lin, Y, Agrawal, D, Chen, C, Ooi, BC & Wu, S 2011, Llama: Leveraging columnar storage for scalable join processing in the MapReduce framework. in Proceedings of the ACM SIGMOD International Conference on Management of Data. pp. 961-972, 2011 ACM SIGMOD and 30th PODS 2011 Conference, Athens, Greece, 12/6/11. https://doi.org/10.1145/1989323.1989424
Lin Y, Agrawal D, Chen C, Ooi BC, Wu S. Llama: Leveraging columnar storage for scalable join processing in the MapReduce framework. In Proceedings of the ACM SIGMOD International Conference on Management of Data. 2011. p. 961-972 https://doi.org/10.1145/1989323.1989424
Lin, Yuting ; Agrawal, Divyakant ; Chen, Chun ; Ooi, Beng Chin ; Wu, Sai. / Llama : Leveraging columnar storage for scalable join processing in the MapReduce framework. Proceedings of the ACM SIGMOD International Conference on Management of Data. 2011. pp. 961-972
@inproceedings{14472ee1761f4e0fb863565534216d5d,
title = "Llama: Leveraging columnar storage for scalable join processing in the MapReduce framework",
abstract = "To achieve high reliability and scalability, most large-scale data warehouse systems have adopted the cluster-based architecture. In this paper, we propose the design of a new cluster-based data warehouse system, LLama, a hybrid data management system which combines the features of row-wise and column-wise database systems. In Llama, columns are formed into correlation groups to provide the basis for the vertical partitioning of tables. Llama employs a distributed file system (DFS) to disseminate data among cluster nodes. Above the DFS, a MapReduce-based query engine is supported. We design a new join algorithm to facilitate fast join processing. We present a performance study on TPC-H dataset and compare Llama with Hive, a data warehouse infrastructure built on top of Hadoop. The experiment is conducted on EC2. The results show that Llama has an excellent load performance and its query performance is significantly better than the traditional MapReduce framework based on row-wise storage.",
keywords = "column store, join, MapReduce",
author = "Yuting Lin and Divyakant Agrawal and Chun Chen and Ooi, {Beng Chin} and Sai Wu",
year = "2011",
month = "7",
day = "11",
doi = "10.1145/1989323.1989424",
language = "English",
isbn = "9781450306614",
pages = "961--972",
booktitle = "Proceedings of the ACM SIGMOD International Conference on Management of Data",

}

TY - GEN

T1 - Llama

T2 - Leveraging columnar storage for scalable join processing in the MapReduce framework

AU - Lin, Yuting

AU - Agrawal, Divyakant

AU - Chen, Chun

AU - Ooi, Beng Chin

AU - Wu, Sai

PY - 2011/7/11

Y1 - 2011/7/11

N2 - To achieve high reliability and scalability, most large-scale data warehouse systems have adopted the cluster-based architecture. In this paper, we propose the design of a new cluster-based data warehouse system, LLama, a hybrid data management system which combines the features of row-wise and column-wise database systems. In Llama, columns are formed into correlation groups to provide the basis for the vertical partitioning of tables. Llama employs a distributed file system (DFS) to disseminate data among cluster nodes. Above the DFS, a MapReduce-based query engine is supported. We design a new join algorithm to facilitate fast join processing. We present a performance study on TPC-H dataset and compare Llama with Hive, a data warehouse infrastructure built on top of Hadoop. The experiment is conducted on EC2. The results show that Llama has an excellent load performance and its query performance is significantly better than the traditional MapReduce framework based on row-wise storage.

AB - To achieve high reliability and scalability, most large-scale data warehouse systems have adopted the cluster-based architecture. In this paper, we propose the design of a new cluster-based data warehouse system, LLama, a hybrid data management system which combines the features of row-wise and column-wise database systems. In Llama, columns are formed into correlation groups to provide the basis for the vertical partitioning of tables. Llama employs a distributed file system (DFS) to disseminate data among cluster nodes. Above the DFS, a MapReduce-based query engine is supported. We design a new join algorithm to facilitate fast join processing. We present a performance study on TPC-H dataset and compare Llama with Hive, a data warehouse infrastructure built on top of Hadoop. The experiment is conducted on EC2. The results show that Llama has an excellent load performance and its query performance is significantly better than the traditional MapReduce framework based on row-wise storage.

KW - column store

KW - join

KW - MapReduce

UR - http://www.scopus.com/inward/record.url?scp=79959945877&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=79959945877&partnerID=8YFLogxK

U2 - 10.1145/1989323.1989424

DO - 10.1145/1989323.1989424

M3 - Conference contribution

SN - 9781450306614

SP - 961

EP - 972

BT - Proceedings of the ACM SIGMOD International Conference on Management of Data

ER -