Data civilizer 2.0: A holistic framework for data preparation and analytics

El Kindi Rezig, Lei Cao, Michael Stonebraker, Giovanni Simonini, Wenbo Tao, Samuel Madden, Mourad Ouzzani, Nan Tang, Ahmed K. Elmagarmid

Research output: Contribution to journalConference article

1 Citation (Scopus)

Abstract

Data scientists spend over 80% of their time (1) parameter-tuning machine learning models and (2) iterating between data cleaning and machine learning model execution. While there are existing efforts to support the first requirement, there is currently no integrated workflow system that couples data cleaning and machine learning development. The previous version of Data Civilizer was geared towards data cleaning and discovery using a set of pre-defined tools. In this paper, we introduce Data Civilizer 2.0, an end-to-end workflow system satisfying both requirements. In addition, this system also supports a sophisticated data debugger and a workflow visualization system. In this demo, we will show how we used Data Civilizer 2.0 to help scientists at the Massachusetts General Hospital build their cleaning and machine learning pipeline on their 30TB brain activity dataset.

Original languageEnglish
Pages (from-to)1954-1957
Number of pages4
JournalProceedings of the VLDB Endowment
Volume12
Issue number12
DOIs
Publication statusPublished - 1 Jan 2018
Event45th International Conference on Very Large Data Bases, VLDB 2019 - Los Angeles, United States
Duration: 26 Aug 201730 Aug 2017

Fingerprint

Learning systems
Cleaning
Brain
Visualization
Tuning
Pipelines

ASJC Scopus subject areas

  • Computer Science (miscellaneous)
  • Computer Science(all)

Cite this

Data civilizer 2.0 : A holistic framework for data preparation and analytics. / Rezig, El Kindi; Cao, Lei; Stonebraker, Michael; Simonini, Giovanni; Tao, Wenbo; Madden, Samuel; Ouzzani, Mourad; Tang, Nan; Elmagarmid, Ahmed K.

In: Proceedings of the VLDB Endowment, Vol. 12, No. 12, 01.01.2018, p. 1954-1957.

Research output: Contribution to journalConference article

Rezig EK, Cao L, Stonebraker M, Simonini G, Tao W, Madden S et al. Data civilizer 2.0: A holistic framework for data preparation and analytics. Proceedings of the VLDB Endowment. 2018 Jan 1;12(12):1954-1957. https://doi.org/10.14778/3352063.3352108
Rezig, El Kindi ; Cao, Lei ; Stonebraker, Michael ; Simonini, Giovanni ; Tao, Wenbo ; Madden, Samuel ; Ouzzani, Mourad ; Tang, Nan ; Elmagarmid, Ahmed K. / Data civilizer 2.0 : A holistic framework for data preparation and analytics. In: Proceedings of the VLDB Endowment. 2018 ; Vol. 12, No. 12. pp. 1954-1957.
@article{8fe885c7bdfc4b4fb6a0e63b4865e7ce,
title = "Data civilizer 2.0: A holistic framework for data preparation and analytics",
abstract = "Data scientists spend over 80{\%} of their time (1) parameter-tuning machine learning models and (2) iterating between data cleaning and machine learning model execution. While there are existing efforts to support the first requirement, there is currently no integrated workflow system that couples data cleaning and machine learning development. The previous version of Data Civilizer was geared towards data cleaning and discovery using a set of pre-defined tools. In this paper, we introduce Data Civilizer 2.0, an end-to-end workflow system satisfying both requirements. In addition, this system also supports a sophisticated data debugger and a workflow visualization system. In this demo, we will show how we used Data Civilizer 2.0 to help scientists at the Massachusetts General Hospital build their cleaning and machine learning pipeline on their 30TB brain activity dataset.",
author = "Rezig, {El Kindi} and Lei Cao and Michael Stonebraker and Giovanni Simonini and Wenbo Tao and Samuel Madden and Mourad Ouzzani and Nan Tang and Elmagarmid, {Ahmed K.}",
year = "2018",
month = "1",
day = "1",
doi = "10.14778/3352063.3352108",
language = "English",
volume = "12",
pages = "1954--1957",
journal = "Proceedings of the VLDB Endowment",
issn = "2150-8097",
publisher = "Very Large Data Base Endowment Inc.",
number = "12",

}

TY - JOUR

T1 - Data civilizer 2.0

T2 - A holistic framework for data preparation and analytics

AU - Rezig, El Kindi

AU - Cao, Lei

AU - Stonebraker, Michael

AU - Simonini, Giovanni

AU - Tao, Wenbo

AU - Madden, Samuel

AU - Ouzzani, Mourad

AU - Tang, Nan

AU - Elmagarmid, Ahmed K.

PY - 2018/1/1

Y1 - 2018/1/1

N2 - Data scientists spend over 80% of their time (1) parameter-tuning machine learning models and (2) iterating between data cleaning and machine learning model execution. While there are existing efforts to support the first requirement, there is currently no integrated workflow system that couples data cleaning and machine learning development. The previous version of Data Civilizer was geared towards data cleaning and discovery using a set of pre-defined tools. In this paper, we introduce Data Civilizer 2.0, an end-to-end workflow system satisfying both requirements. In addition, this system also supports a sophisticated data debugger and a workflow visualization system. In this demo, we will show how we used Data Civilizer 2.0 to help scientists at the Massachusetts General Hospital build their cleaning and machine learning pipeline on their 30TB brain activity dataset.

AB - Data scientists spend over 80% of their time (1) parameter-tuning machine learning models and (2) iterating between data cleaning and machine learning model execution. While there are existing efforts to support the first requirement, there is currently no integrated workflow system that couples data cleaning and machine learning development. The previous version of Data Civilizer was geared towards data cleaning and discovery using a set of pre-defined tools. In this paper, we introduce Data Civilizer 2.0, an end-to-end workflow system satisfying both requirements. In addition, this system also supports a sophisticated data debugger and a workflow visualization system. In this demo, we will show how we used Data Civilizer 2.0 to help scientists at the Massachusetts General Hospital build their cleaning and machine learning pipeline on their 30TB brain activity dataset.

UR - http://www.scopus.com/inward/record.url?scp=85074539229&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=85074539229&partnerID=8YFLogxK

U2 - 10.14778/3352063.3352108

DO - 10.14778/3352063.3352108

M3 - Conference article

AN - SCOPUS:85074539229

VL - 12

SP - 1954

EP - 1957

JO - Proceedings of the VLDB Endowment

JF - Proceedings of the VLDB Endowment

SN - 2150-8097

IS - 12

ER -