The data civilizer system

Dong Deng, Raul Castro Fernandez, Ziawasch Abedjan, Sibo Wang, Michael Stonebraker, Ahmed Elmagarmid, Ihab F. Ilyas, Samuel Madden, Mourad Ouzzani, Nan Tang

Research output: Contribution to conferencePaper

32 Citations (Scopus)

Abstract

In many organizations, it is often challenging for users to find relevant data for specific tasks, since the data is usually scattered across the enterprise and often inconsistent. In fact, data scientists routinely report that the majority of their effort is spent finding, cleaning, integrating, and accessing data of interest to a task at hand. In order to decrease the “grunt work” needed to facilitate the analysis of data “in the wild”, we present DATA CIVILIZER, an end-to-end big data management system. DATA CIVILIZER has a linkage graph computation module to build a linkage graph for the data and a data discovery module which utilizes the linkage graph to help identify data that is relevant to user tasks. It also uses the linkage graph to discover possible join paths that can then be used in a query. For the actual query execution, we use a polystore DBMS, which federates query processing across disparate systems. In addition, DATA CIVILIZER integrates data cleaning operations into query processing. Because different users need to invoke the above tasks in different orders, DATA CIVILIZER embeds a workflow engine which enables the arbitrary composition of different modules, as well as the handling of data updates. We have deployed our preliminary DATA CIVILIZER system in two institutions, MIT and Merck and describe initial positive experiences that show the system shortens the time and effort required to find, prepare, and analyze data.

Original languageEnglish
Publication statusPublished - 1 Jan 2017
Event8th Biennial Conference on Innovative Data Systems Research, CIDR 2017 - Santa Cruz, United States
Duration: 8 Jan 201711 Jan 2017

Conference

Conference8th Biennial Conference on Innovative Data Systems Research, CIDR 2017
CountryUnited States
CitySanta Cruz
Period8/1/1711/1/17

Fingerprint

Query processing
Cleaning
Information management
Engines
Chemical analysis
Industry
Linkage
Graph
Module
Big data
Query

ASJC Scopus subject areas

  • Information Systems
  • Artificial Intelligence
  • Information Systems and Management
  • Hardware and Architecture

Cite this

Deng, D., Castro Fernandez, R., Abedjan, Z., Wang, S., Stonebraker, M., Elmagarmid, A., ... Tang, N. (2017). The data civilizer system. Paper presented at 8th Biennial Conference on Innovative Data Systems Research, CIDR 2017, Santa Cruz, United States.

The data civilizer system. / Deng, Dong; Castro Fernandez, Raul; Abedjan, Ziawasch; Wang, Sibo; Stonebraker, Michael; Elmagarmid, Ahmed; Ilyas, Ihab F.; Madden, Samuel; Ouzzani, Mourad; Tang, Nan.

2017. Paper presented at 8th Biennial Conference on Innovative Data Systems Research, CIDR 2017, Santa Cruz, United States.

Research output: Contribution to conferencePaper

Deng, D, Castro Fernandez, R, Abedjan, Z, Wang, S, Stonebraker, M, Elmagarmid, A, Ilyas, IF, Madden, S, Ouzzani, M & Tang, N 2017, 'The data civilizer system', Paper presented at 8th Biennial Conference on Innovative Data Systems Research, CIDR 2017, Santa Cruz, United States, 8/1/17 - 11/1/17.
Deng D, Castro Fernandez R, Abedjan Z, Wang S, Stonebraker M, Elmagarmid A et al. The data civilizer system. 2017. Paper presented at 8th Biennial Conference on Innovative Data Systems Research, CIDR 2017, Santa Cruz, United States.
Deng, Dong ; Castro Fernandez, Raul ; Abedjan, Ziawasch ; Wang, Sibo ; Stonebraker, Michael ; Elmagarmid, Ahmed ; Ilyas, Ihab F. ; Madden, Samuel ; Ouzzani, Mourad ; Tang, Nan. / The data civilizer system. Paper presented at 8th Biennial Conference on Innovative Data Systems Research, CIDR 2017, Santa Cruz, United States.
@conference{6aea0a044cfe45c5a18a05d1e423cf64,
title = "The data civilizer system",
abstract = "In many organizations, it is often challenging for users to find relevant data for specific tasks, since the data is usually scattered across the enterprise and often inconsistent. In fact, data scientists routinely report that the majority of their effort is spent finding, cleaning, integrating, and accessing data of interest to a task at hand. In order to decrease the “grunt work” needed to facilitate the analysis of data “in the wild”, we present DATA CIVILIZER, an end-to-end big data management system. DATA CIVILIZER has a linkage graph computation module to build a linkage graph for the data and a data discovery module which utilizes the linkage graph to help identify data that is relevant to user tasks. It also uses the linkage graph to discover possible join paths that can then be used in a query. For the actual query execution, we use a polystore DBMS, which federates query processing across disparate systems. In addition, DATA CIVILIZER integrates data cleaning operations into query processing. Because different users need to invoke the above tasks in different orders, DATA CIVILIZER embeds a workflow engine which enables the arbitrary composition of different modules, as well as the handling of data updates. We have deployed our preliminary DATA CIVILIZER system in two institutions, MIT and Merck and describe initial positive experiences that show the system shortens the time and effort required to find, prepare, and analyze data.",
author = "Dong Deng and {Castro Fernandez}, Raul and Ziawasch Abedjan and Sibo Wang and Michael Stonebraker and Ahmed Elmagarmid and Ilyas, {Ihab F.} and Samuel Madden and Mourad Ouzzani and Nan Tang",
year = "2017",
month = "1",
day = "1",
language = "English",
note = "8th Biennial Conference on Innovative Data Systems Research, CIDR 2017 ; Conference date: 08-01-2017 Through 11-01-2017",

}

TY - CONF

T1 - The data civilizer system

AU - Deng, Dong

AU - Castro Fernandez, Raul

AU - Abedjan, Ziawasch

AU - Wang, Sibo

AU - Stonebraker, Michael

AU - Elmagarmid, Ahmed

AU - Ilyas, Ihab F.

AU - Madden, Samuel

AU - Ouzzani, Mourad

AU - Tang, Nan

PY - 2017/1/1

Y1 - 2017/1/1

N2 - In many organizations, it is often challenging for users to find relevant data for specific tasks, since the data is usually scattered across the enterprise and often inconsistent. In fact, data scientists routinely report that the majority of their effort is spent finding, cleaning, integrating, and accessing data of interest to a task at hand. In order to decrease the “grunt work” needed to facilitate the analysis of data “in the wild”, we present DATA CIVILIZER, an end-to-end big data management system. DATA CIVILIZER has a linkage graph computation module to build a linkage graph for the data and a data discovery module which utilizes the linkage graph to help identify data that is relevant to user tasks. It also uses the linkage graph to discover possible join paths that can then be used in a query. For the actual query execution, we use a polystore DBMS, which federates query processing across disparate systems. In addition, DATA CIVILIZER integrates data cleaning operations into query processing. Because different users need to invoke the above tasks in different orders, DATA CIVILIZER embeds a workflow engine which enables the arbitrary composition of different modules, as well as the handling of data updates. We have deployed our preliminary DATA CIVILIZER system in two institutions, MIT and Merck and describe initial positive experiences that show the system shortens the time and effort required to find, prepare, and analyze data.

AB - In many organizations, it is often challenging for users to find relevant data for specific tasks, since the data is usually scattered across the enterprise and often inconsistent. In fact, data scientists routinely report that the majority of their effort is spent finding, cleaning, integrating, and accessing data of interest to a task at hand. In order to decrease the “grunt work” needed to facilitate the analysis of data “in the wild”, we present DATA CIVILIZER, an end-to-end big data management system. DATA CIVILIZER has a linkage graph computation module to build a linkage graph for the data and a data discovery module which utilizes the linkage graph to help identify data that is relevant to user tasks. It also uses the linkage graph to discover possible join paths that can then be used in a query. For the actual query execution, we use a polystore DBMS, which federates query processing across disparate systems. In addition, DATA CIVILIZER integrates data cleaning operations into query processing. Because different users need to invoke the above tasks in different orders, DATA CIVILIZER embeds a workflow engine which enables the arbitrary composition of different modules, as well as the handling of data updates. We have deployed our preliminary DATA CIVILIZER system in two institutions, MIT and Merck and describe initial positive experiences that show the system shortens the time and effort required to find, prepare, and analyze data.

UR - http://www.scopus.com/inward/record.url?scp=85072864586&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=85072864586&partnerID=8YFLogxK

M3 - Paper

AN - SCOPUS:85072864586

ER -