RHEEM: Enabling cross platform data processing

Divy Agrawal, Sanjay Chawla, Bertty Contreras-Rojas, Ahmed Elmagarmid, Yasser Idris, Zoi Kaoudi, Sebastian Kruse, Ji Lucas, Essam Mansour, Mourad Ouzzani, Paolo Papotti, Jorge Arnulfo Quiane´-Ruiz, Nan Tang, Saravanan Thirumuruganathan, Anis Troudi

Research output: Contribution to journalConference article

6 Citations (Scopus)

Abstract

Solving business problems increasingly requires going beyond the limits of a single data processing platform (platform for short), such as Hadoop or a DBMS. As a result, organizations typically perform tedious and costly tasks to juggle their code and data across different platforms. Addressing this pain and achieving automatic cross-platform data processing is quite challenging: finding the most efficient platform for a given task requires quite good expertise for all the available platforms. We present Rheem, a general-purpose cross-platform data processing system that decouples applications from the underlying platforms. It not only determines the best platform to run an incoming task, but also splits the task into subtasks and assigns each subtask to a specific platform to minimize the overall cost (e.g., runtime or monetary cost). It features (i) an interface to easily compose data analytic tasks; (ii) a novel cost-based optimizer able to find the most efficient platform in almost all cases; and (iii) an executor to efficiently orchestrate tasks over different platforms. As a result, it allows users to focus on the business logic of their applications rather than on the mechanics of how to compose and execute them. Using different real-world applications with Rheem, we demonstrate how cross-platform data processing can accelerate performance by more than one order of magnitude compared to single-platform data processing.

Original languageEnglish
Pages (from-to)1414-1427
Number of pages14
JournalProceedings of the VLDB Endowment
Volume11
Issue number11
DOIs
Publication statusPublished - 1 Jan 2017
Event44th International Conference on Very Large Data Bases, VLDB 2018 - Rio de Janeiro, Brazil
Duration: 27 Aug 201731 Aug 2017

Fingerprint

Costs
Industry
Mechanics

ASJC Scopus subject areas

  • Computer Science (miscellaneous)
  • Computer Science(all)

Cite this

RHEEM : Enabling cross platform data processing. / Agrawal, Divy; Chawla, Sanjay; Contreras-Rojas, Bertty; Elmagarmid, Ahmed; Idris, Yasser; Kaoudi, Zoi; Kruse, Sebastian; Lucas, Ji; Mansour, Essam; Ouzzani, Mourad; Papotti, Paolo; Quiane´-Ruiz, Jorge Arnulfo; Tang, Nan; Thirumuruganathan, Saravanan; Troudi, Anis.

In: Proceedings of the VLDB Endowment, Vol. 11, No. 11, 01.01.2017, p. 1414-1427.

Research output: Contribution to journalConference article

Agrawal, Divy ; Chawla, Sanjay ; Contreras-Rojas, Bertty ; Elmagarmid, Ahmed ; Idris, Yasser ; Kaoudi, Zoi ; Kruse, Sebastian ; Lucas, Ji ; Mansour, Essam ; Ouzzani, Mourad ; Papotti, Paolo ; Quiane´-Ruiz, Jorge Arnulfo ; Tang, Nan ; Thirumuruganathan, Saravanan ; Troudi, Anis. / RHEEM : Enabling cross platform data processing. In: Proceedings of the VLDB Endowment. 2017 ; Vol. 11, No. 11. pp. 1414-1427.
@article{46affabb6ce842029223d3002a177d59,
title = "RHEEM: Enabling cross platform data processing",
abstract = "Solving business problems increasingly requires going beyond the limits of a single data processing platform (platform for short), such as Hadoop or a DBMS. As a result, organizations typically perform tedious and costly tasks to juggle their code and data across different platforms. Addressing this pain and achieving automatic cross-platform data processing is quite challenging: finding the most efficient platform for a given task requires quite good expertise for all the available platforms. We present Rheem, a general-purpose cross-platform data processing system that decouples applications from the underlying platforms. It not only determines the best platform to run an incoming task, but also splits the task into subtasks and assigns each subtask to a specific platform to minimize the overall cost (e.g., runtime or monetary cost). It features (i) an interface to easily compose data analytic tasks; (ii) a novel cost-based optimizer able to find the most efficient platform in almost all cases; and (iii) an executor to efficiently orchestrate tasks over different platforms. As a result, it allows users to focus on the business logic of their applications rather than on the mechanics of how to compose and execute them. Using different real-world applications with Rheem, we demonstrate how cross-platform data processing can accelerate performance by more than one order of magnitude compared to single-platform data processing.",
author = "Divy Agrawal and Sanjay Chawla and Bertty Contreras-Rojas and Ahmed Elmagarmid and Yasser Idris and Zoi Kaoudi and Sebastian Kruse and Ji Lucas and Essam Mansour and Mourad Ouzzani and Paolo Papotti and Quiane´-Ruiz, {Jorge Arnulfo} and Nan Tang and Saravanan Thirumuruganathan and Anis Troudi",
year = "2017",
month = "1",
day = "1",
doi = "10.14778/3236187.3236195",
language = "English",
volume = "11",
pages = "1414--1427",
journal = "Proceedings of the VLDB Endowment",
issn = "2150-8097",
publisher = "Very Large Data Base Endowment Inc.",
number = "11",

}

TY - JOUR

T1 - RHEEM

T2 - Enabling cross platform data processing

AU - Agrawal, Divy

AU - Chawla, Sanjay

AU - Contreras-Rojas, Bertty

AU - Elmagarmid, Ahmed

AU - Idris, Yasser

AU - Kaoudi, Zoi

AU - Kruse, Sebastian

AU - Lucas, Ji

AU - Mansour, Essam

AU - Ouzzani, Mourad

AU - Papotti, Paolo

AU - Quiane´-Ruiz, Jorge Arnulfo

AU - Tang, Nan

AU - Thirumuruganathan, Saravanan

AU - Troudi, Anis

PY - 2017/1/1

Y1 - 2017/1/1

N2 - Solving business problems increasingly requires going beyond the limits of a single data processing platform (platform for short), such as Hadoop or a DBMS. As a result, organizations typically perform tedious and costly tasks to juggle their code and data across different platforms. Addressing this pain and achieving automatic cross-platform data processing is quite challenging: finding the most efficient platform for a given task requires quite good expertise for all the available platforms. We present Rheem, a general-purpose cross-platform data processing system that decouples applications from the underlying platforms. It not only determines the best platform to run an incoming task, but also splits the task into subtasks and assigns each subtask to a specific platform to minimize the overall cost (e.g., runtime or monetary cost). It features (i) an interface to easily compose data analytic tasks; (ii) a novel cost-based optimizer able to find the most efficient platform in almost all cases; and (iii) an executor to efficiently orchestrate tasks over different platforms. As a result, it allows users to focus on the business logic of their applications rather than on the mechanics of how to compose and execute them. Using different real-world applications with Rheem, we demonstrate how cross-platform data processing can accelerate performance by more than one order of magnitude compared to single-platform data processing.

AB - Solving business problems increasingly requires going beyond the limits of a single data processing platform (platform for short), such as Hadoop or a DBMS. As a result, organizations typically perform tedious and costly tasks to juggle their code and data across different platforms. Addressing this pain and achieving automatic cross-platform data processing is quite challenging: finding the most efficient platform for a given task requires quite good expertise for all the available platforms. We present Rheem, a general-purpose cross-platform data processing system that decouples applications from the underlying platforms. It not only determines the best platform to run an incoming task, but also splits the task into subtasks and assigns each subtask to a specific platform to minimize the overall cost (e.g., runtime or monetary cost). It features (i) an interface to easily compose data analytic tasks; (ii) a novel cost-based optimizer able to find the most efficient platform in almost all cases; and (iii) an executor to efficiently orchestrate tasks over different platforms. As a result, it allows users to focus on the business logic of their applications rather than on the mechanics of how to compose and execute them. Using different real-world applications with Rheem, we demonstrate how cross-platform data processing can accelerate performance by more than one order of magnitude compared to single-platform data processing.

UR - http://www.scopus.com/inward/record.url?scp=85058894597&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=85058894597&partnerID=8YFLogxK

U2 - 10.14778/3236187.3236195

DO - 10.14778/3236187.3236195

M3 - Conference article

AN - SCOPUS:85058894597

VL - 11

SP - 1414

EP - 1427

JO - Proceedings of the VLDB Endowment

JF - Proceedings of the VLDB Endowment

SN - 2150-8097

IS - 11

ER -