Hadoop++: Making a yellow elephant run like a cheetah (without it even noticing)

Jens Dittrich, Jorge Arnulfo Quiane Ruiz, Alekh Jindal, Yagiz Kargin, Vinay Setty, Jörg Schad

Research output: Chapter in Book/Report/Conference proceedingChapter

292 Citations (Scopus)

Abstract

MapReduce is a computing paradigm that has gained a lot of attention in recent years from industry and research. Unlike parallel DBMSs, MapReduce allows non-expert users to run complex analytical tasks over very large data sets on very large clusters and clouds. However, this comes at a price: MapReduce processes tasks in a scan-oriented fashion. Hence, the performance of Hadoop-an open-source implementation of MapReduce-often does not match the one of a well-configured parallel DBMS. In this paper we propose a new type of system named Hadoop++: it boosts task performance without changing the Hadoop framework at all (Hadoop does not even 'notice it'). To reach this goal, rather than changing a working system (Hadoop), we inject our technology at the right places through UDFs only and affect Hadoop from inside. This has three important consequences: First, Hadoop++ significantly outperforms Hadoop. Second, any future changes of Hadoop may directly be used with Hadoop++ without rewriting any glue code. Third, Hadoop++ does not need to change the Hadoop interface. Our experiments show the superiority of Hadoop++ over both Hadoop and HadoopDB for tasks related to indexing and join processing.

Original languageEnglish
Title of host publicationProceedings of the VLDB Endowment
Pages518-529
Number of pages12
Volume3
Edition1
Publication statusPublished - Sep 2010
Externally publishedYes

Fingerprint

Glues
Processing
Industry
Experiments

ASJC Scopus subject areas

  • Computer Science (miscellaneous)
  • Computer Science(all)

Cite this

Dittrich, J., Quiane Ruiz, J. A., Jindal, A., Kargin, Y., Setty, V., & Schad, J. (2010). Hadoop++: Making a yellow elephant run like a cheetah (without it even noticing). In Proceedings of the VLDB Endowment (1 ed., Vol. 3, pp. 518-529)

Hadoop++ : Making a yellow elephant run like a cheetah (without it even noticing). / Dittrich, Jens; Quiane Ruiz, Jorge Arnulfo; Jindal, Alekh; Kargin, Yagiz; Setty, Vinay; Schad, Jörg.

Proceedings of the VLDB Endowment. Vol. 3 1. ed. 2010. p. 518-529.

Research output: Chapter in Book/Report/Conference proceedingChapter

Dittrich, J, Quiane Ruiz, JA, Jindal, A, Kargin, Y, Setty, V & Schad, J 2010, Hadoop++: Making a yellow elephant run like a cheetah (without it even noticing). in Proceedings of the VLDB Endowment. 1 edn, vol. 3, pp. 518-529.
Dittrich J, Quiane Ruiz JA, Jindal A, Kargin Y, Setty V, Schad J. Hadoop++: Making a yellow elephant run like a cheetah (without it even noticing). In Proceedings of the VLDB Endowment. 1 ed. Vol. 3. 2010. p. 518-529
Dittrich, Jens ; Quiane Ruiz, Jorge Arnulfo ; Jindal, Alekh ; Kargin, Yagiz ; Setty, Vinay ; Schad, Jörg. / Hadoop++ : Making a yellow elephant run like a cheetah (without it even noticing). Proceedings of the VLDB Endowment. Vol. 3 1. ed. 2010. pp. 518-529
@inbook{af481abea94e4eb69a8a4fbffd5da7b4,
title = "Hadoop++: Making a yellow elephant run like a cheetah (without it even noticing)",
abstract = "MapReduce is a computing paradigm that has gained a lot of attention in recent years from industry and research. Unlike parallel DBMSs, MapReduce allows non-expert users to run complex analytical tasks over very large data sets on very large clusters and clouds. However, this comes at a price: MapReduce processes tasks in a scan-oriented fashion. Hence, the performance of Hadoop-an open-source implementation of MapReduce-often does not match the one of a well-configured parallel DBMS. In this paper we propose a new type of system named Hadoop++: it boosts task performance without changing the Hadoop framework at all (Hadoop does not even 'notice it'). To reach this goal, rather than changing a working system (Hadoop), we inject our technology at the right places through UDFs only and affect Hadoop from inside. This has three important consequences: First, Hadoop++ significantly outperforms Hadoop. Second, any future changes of Hadoop may directly be used with Hadoop++ without rewriting any glue code. Third, Hadoop++ does not need to change the Hadoop interface. Our experiments show the superiority of Hadoop++ over both Hadoop and HadoopDB for tasks related to indexing and join processing.",
author = "Jens Dittrich and {Quiane Ruiz}, {Jorge Arnulfo} and Alekh Jindal and Yagiz Kargin and Vinay Setty and J{\"o}rg Schad",
year = "2010",
month = "9",
language = "English",
volume = "3",
pages = "518--529",
booktitle = "Proceedings of the VLDB Endowment",
edition = "1",

}

TY - CHAP

T1 - Hadoop++

T2 - Making a yellow elephant run like a cheetah (without it even noticing)

AU - Dittrich, Jens

AU - Quiane Ruiz, Jorge Arnulfo

AU - Jindal, Alekh

AU - Kargin, Yagiz

AU - Setty, Vinay

AU - Schad, Jörg

PY - 2010/9

Y1 - 2010/9

N2 - MapReduce is a computing paradigm that has gained a lot of attention in recent years from industry and research. Unlike parallel DBMSs, MapReduce allows non-expert users to run complex analytical tasks over very large data sets on very large clusters and clouds. However, this comes at a price: MapReduce processes tasks in a scan-oriented fashion. Hence, the performance of Hadoop-an open-source implementation of MapReduce-often does not match the one of a well-configured parallel DBMS. In this paper we propose a new type of system named Hadoop++: it boosts task performance without changing the Hadoop framework at all (Hadoop does not even 'notice it'). To reach this goal, rather than changing a working system (Hadoop), we inject our technology at the right places through UDFs only and affect Hadoop from inside. This has three important consequences: First, Hadoop++ significantly outperforms Hadoop. Second, any future changes of Hadoop may directly be used with Hadoop++ without rewriting any glue code. Third, Hadoop++ does not need to change the Hadoop interface. Our experiments show the superiority of Hadoop++ over both Hadoop and HadoopDB for tasks related to indexing and join processing.

AB - MapReduce is a computing paradigm that has gained a lot of attention in recent years from industry and research. Unlike parallel DBMSs, MapReduce allows non-expert users to run complex analytical tasks over very large data sets on very large clusters and clouds. However, this comes at a price: MapReduce processes tasks in a scan-oriented fashion. Hence, the performance of Hadoop-an open-source implementation of MapReduce-often does not match the one of a well-configured parallel DBMS. In this paper we propose a new type of system named Hadoop++: it boosts task performance without changing the Hadoop framework at all (Hadoop does not even 'notice it'). To reach this goal, rather than changing a working system (Hadoop), we inject our technology at the right places through UDFs only and affect Hadoop from inside. This has three important consequences: First, Hadoop++ significantly outperforms Hadoop. Second, any future changes of Hadoop may directly be used with Hadoop++ without rewriting any glue code. Third, Hadoop++ does not need to change the Hadoop interface. Our experiments show the superiority of Hadoop++ over both Hadoop and HadoopDB for tasks related to indexing and join processing.

UR - http://www.scopus.com/inward/record.url?scp=80053521271&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=80053521271&partnerID=8YFLogxK

M3 - Chapter

AN - SCOPUS:80053521271

VL - 3

SP - 518

EP - 529

BT - Proceedings of the VLDB Endowment

ER -