BigDansing

A system for big data cleansing

Zuhair Khayyaty, Ihab F. Ilyasz, Alekh Jindal, Samuel Madden, Mourad Ouzzani, Paolo Papotti, Jorge Arnulfo Quiane Ruiz, Nan Tang, Si Yin

Research output: Chapter in Book/Report/Conference proceedingConference contribution

43 Citations (Scopus)

Abstract

Data cleansing approaches have usually focused on detecting and fixing errors with little attention to scaling to big datasets. This presents a serious impediment since data cleansing often involves costly computations such as enumerating pairs of tuples, handling inequality joins, and dealing with user-defined functions. In this paper, we present BigDansing, a Big Data Cleansing system to tackle efficiency, scalability, and ease-of-use issues in data cleansing. The system can run on top of most common general purpose data processing platforms, ranging from DBMSs to MapReduce-like frameworks. A user-friendly programming interface allows users to express data quality rules both declaratively and procedurally, with no requirement of being aware of the underlying distributed platform. BigDansing takes these rules into a series of transformations that enable distributed computations and several optimizations, such as shared scans and specialized joins operators. Experimental results on both synthetic and real datasets show that Big-Dansing outperforms existing baseline systems up to more than two orders of magnitude without sacrificing the quality provided by the repair algorithms.

Original languageEnglish
Title of host publicationProceedings of the ACM SIGMOD International Conference on Management of Data
PublisherAssociation for Computing Machinery
Pages1215-1230
Number of pages16
Volume2015-May
ISBN (Print)9781450327589
DOIs
Publication statusPublished - 27 May 2015
EventACM SIGMOD International Conference on Management of Data, SIGMOD 2015 - Melbourne, Australia
Duration: 31 May 20154 Jun 2015

Other

OtherACM SIGMOD International Conference on Management of Data, SIGMOD 2015
CountryAustralia
CityMelbourne
Period31/5/154/6/15

Fingerprint

User interfaces
Scalability
Repair
Big data

ASJC Scopus subject areas

  • Software
  • Information Systems

Cite this

Khayyaty, Z., Ilyasz, I. F., Jindal, A., Madden, S., Ouzzani, M., Papotti, P., ... Yin, S. (2015). BigDansing: A system for big data cleansing. In Proceedings of the ACM SIGMOD International Conference on Management of Data (Vol. 2015-May, pp. 1215-1230). Association for Computing Machinery. https://doi.org/10.1145/2723372.2747646

BigDansing : A system for big data cleansing. / Khayyaty, Zuhair; Ilyasz, Ihab F.; Jindal, Alekh; Madden, Samuel; Ouzzani, Mourad; Papotti, Paolo; Quiane Ruiz, Jorge Arnulfo; Tang, Nan; Yin, Si.

Proceedings of the ACM SIGMOD International Conference on Management of Data. Vol. 2015-May Association for Computing Machinery, 2015. p. 1215-1230.

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Khayyaty, Z, Ilyasz, IF, Jindal, A, Madden, S, Ouzzani, M, Papotti, P, Quiane Ruiz, JA, Tang, N & Yin, S 2015, BigDansing: A system for big data cleansing. in Proceedings of the ACM SIGMOD International Conference on Management of Data. vol. 2015-May, Association for Computing Machinery, pp. 1215-1230, ACM SIGMOD International Conference on Management of Data, SIGMOD 2015, Melbourne, Australia, 31/5/15. https://doi.org/10.1145/2723372.2747646
Khayyaty Z, Ilyasz IF, Jindal A, Madden S, Ouzzani M, Papotti P et al. BigDansing: A system for big data cleansing. In Proceedings of the ACM SIGMOD International Conference on Management of Data. Vol. 2015-May. Association for Computing Machinery. 2015. p. 1215-1230 https://doi.org/10.1145/2723372.2747646
Khayyaty, Zuhair ; Ilyasz, Ihab F. ; Jindal, Alekh ; Madden, Samuel ; Ouzzani, Mourad ; Papotti, Paolo ; Quiane Ruiz, Jorge Arnulfo ; Tang, Nan ; Yin, Si. / BigDansing : A system for big data cleansing. Proceedings of the ACM SIGMOD International Conference on Management of Data. Vol. 2015-May Association for Computing Machinery, 2015. pp. 1215-1230
@inproceedings{5f78358016e44c22bf2bee6afccb1ddd,
title = "BigDansing: A system for big data cleansing",
abstract = "Data cleansing approaches have usually focused on detecting and fixing errors with little attention to scaling to big datasets. This presents a serious impediment since data cleansing often involves costly computations such as enumerating pairs of tuples, handling inequality joins, and dealing with user-defined functions. In this paper, we present BigDansing, a Big Data Cleansing system to tackle efficiency, scalability, and ease-of-use issues in data cleansing. The system can run on top of most common general purpose data processing platforms, ranging from DBMSs to MapReduce-like frameworks. A user-friendly programming interface allows users to express data quality rules both declaratively and procedurally, with no requirement of being aware of the underlying distributed platform. BigDansing takes these rules into a series of transformations that enable distributed computations and several optimizations, such as shared scans and specialized joins operators. Experimental results on both synthetic and real datasets show that Big-Dansing outperforms existing baseline systems up to more than two orders of magnitude without sacrificing the quality provided by the repair algorithms.",
author = "Zuhair Khayyaty and Ilyasz, {Ihab F.} and Alekh Jindal and Samuel Madden and Mourad Ouzzani and Paolo Papotti and {Quiane Ruiz}, {Jorge Arnulfo} and Nan Tang and Si Yin",
year = "2015",
month = "5",
day = "27",
doi = "10.1145/2723372.2747646",
language = "English",
isbn = "9781450327589",
volume = "2015-May",
pages = "1215--1230",
booktitle = "Proceedings of the ACM SIGMOD International Conference on Management of Data",
publisher = "Association for Computing Machinery",

}

TY - GEN

T1 - BigDansing

T2 - A system for big data cleansing

AU - Khayyaty, Zuhair

AU - Ilyasz, Ihab F.

AU - Jindal, Alekh

AU - Madden, Samuel

AU - Ouzzani, Mourad

AU - Papotti, Paolo

AU - Quiane Ruiz, Jorge Arnulfo

AU - Tang, Nan

AU - Yin, Si

PY - 2015/5/27

Y1 - 2015/5/27

N2 - Data cleansing approaches have usually focused on detecting and fixing errors with little attention to scaling to big datasets. This presents a serious impediment since data cleansing often involves costly computations such as enumerating pairs of tuples, handling inequality joins, and dealing with user-defined functions. In this paper, we present BigDansing, a Big Data Cleansing system to tackle efficiency, scalability, and ease-of-use issues in data cleansing. The system can run on top of most common general purpose data processing platforms, ranging from DBMSs to MapReduce-like frameworks. A user-friendly programming interface allows users to express data quality rules both declaratively and procedurally, with no requirement of being aware of the underlying distributed platform. BigDansing takes these rules into a series of transformations that enable distributed computations and several optimizations, such as shared scans and specialized joins operators. Experimental results on both synthetic and real datasets show that Big-Dansing outperforms existing baseline systems up to more than two orders of magnitude without sacrificing the quality provided by the repair algorithms.

AB - Data cleansing approaches have usually focused on detecting and fixing errors with little attention to scaling to big datasets. This presents a serious impediment since data cleansing often involves costly computations such as enumerating pairs of tuples, handling inequality joins, and dealing with user-defined functions. In this paper, we present BigDansing, a Big Data Cleansing system to tackle efficiency, scalability, and ease-of-use issues in data cleansing. The system can run on top of most common general purpose data processing platforms, ranging from DBMSs to MapReduce-like frameworks. A user-friendly programming interface allows users to express data quality rules both declaratively and procedurally, with no requirement of being aware of the underlying distributed platform. BigDansing takes these rules into a series of transformations that enable distributed computations and several optimizations, such as shared scans and specialized joins operators. Experimental results on both synthetic and real datasets show that Big-Dansing outperforms existing baseline systems up to more than two orders of magnitude without sacrificing the quality provided by the repair algorithms.

UR - http://www.scopus.com/inward/record.url?scp=84949872769&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=84949872769&partnerID=8YFLogxK

U2 - 10.1145/2723372.2747646

DO - 10.1145/2723372.2747646

M3 - Conference contribution

SN - 9781450327589

VL - 2015-May

SP - 1215

EP - 1230

BT - Proceedings of the ACM SIGMOD International Conference on Management of Data

PB - Association for Computing Machinery

ER -