Holistic data cleaning: Putting violations into context

Xu Chu, Ihab F. Ilyas, Paolo Papotti

Research output: Chapter in Book/Report/Conference proceedingConference contribution

109 Citations (Scopus)

Abstract

Data cleaning is an important problem and data quality rules are the most promising way to face it with a declarative approach. Previous work has focused on specific formalisms, such as functional dependencies (FDs), conditional functional dependencies (CFDs), and matching dependencies (MDs), and those have always been studied in isolation. Moreover, such techniques are usually applied in a pipeline or interleaved. In this work we tackle the problem in a novel, unified framework. First, we let users specify quality rules using denial constraints with ad-hoc predicates. This language subsumes existing formalisms and can express rules involving numerical values, with predicates such as "greater than" and "less than". More importantly, we exploit the interaction of the heterogeneous constraints by encoding them in a conflict hypergraph. Such holistic view of the conflicts is the starting point for a novel definition of repair context which allows us to compute automatically repairs of better quality w.r.t. previous approaches in the literature. Experimental results on real datasets show that the holistic approach outperforms previous algorithms in terms of quality and efficiency of the repair.

Original languageEnglish
Title of host publicationProceedings - International Conference on Data Engineering
Pages458-469
Number of pages12
DOIs
Publication statusPublished - 15 Aug 2013
Event29th International Conference on Data Engineering, ICDE 2013 - Brisbane, QLD, Australia
Duration: 8 Apr 201311 Apr 2013

Other

Other29th International Conference on Data Engineering, ICDE 2013
CountryAustralia
CityBrisbane, QLD
Period8/4/1311/4/13

Fingerprint

Cleaning
Repair
Pipelines

ASJC Scopus subject areas

  • Information Systems
  • Signal Processing
  • Software

Cite this

Chu, X., Ilyas, I. F., & Papotti, P. (2013). Holistic data cleaning: Putting violations into context. In Proceedings - International Conference on Data Engineering (pp. 458-469). [6544847] https://doi.org/10.1109/ICDE.2013.6544847

Holistic data cleaning : Putting violations into context. / Chu, Xu; Ilyas, Ihab F.; Papotti, Paolo.

Proceedings - International Conference on Data Engineering. 2013. p. 458-469 6544847.

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Chu, X, Ilyas, IF & Papotti, P 2013, Holistic data cleaning: Putting violations into context. in Proceedings - International Conference on Data Engineering., 6544847, pp. 458-469, 29th International Conference on Data Engineering, ICDE 2013, Brisbane, QLD, Australia, 8/4/13. https://doi.org/10.1109/ICDE.2013.6544847
Chu X, Ilyas IF, Papotti P. Holistic data cleaning: Putting violations into context. In Proceedings - International Conference on Data Engineering. 2013. p. 458-469. 6544847 https://doi.org/10.1109/ICDE.2013.6544847
Chu, Xu ; Ilyas, Ihab F. ; Papotti, Paolo. / Holistic data cleaning : Putting violations into context. Proceedings - International Conference on Data Engineering. 2013. pp. 458-469
@inproceedings{f65a85b222274922b597f7851506a927,
title = "Holistic data cleaning: Putting violations into context",
abstract = "Data cleaning is an important problem and data quality rules are the most promising way to face it with a declarative approach. Previous work has focused on specific formalisms, such as functional dependencies (FDs), conditional functional dependencies (CFDs), and matching dependencies (MDs), and those have always been studied in isolation. Moreover, such techniques are usually applied in a pipeline or interleaved. In this work we tackle the problem in a novel, unified framework. First, we let users specify quality rules using denial constraints with ad-hoc predicates. This language subsumes existing formalisms and can express rules involving numerical values, with predicates such as {"}greater than{"} and {"}less than{"}. More importantly, we exploit the interaction of the heterogeneous constraints by encoding them in a conflict hypergraph. Such holistic view of the conflicts is the starting point for a novel definition of repair context which allows us to compute automatically repairs of better quality w.r.t. previous approaches in the literature. Experimental results on real datasets show that the holistic approach outperforms previous algorithms in terms of quality and efficiency of the repair.",
author = "Xu Chu and Ilyas, {Ihab F.} and Paolo Papotti",
year = "2013",
month = "8",
day = "15",
doi = "10.1109/ICDE.2013.6544847",
language = "English",
isbn = "9781467349086",
pages = "458--469",
booktitle = "Proceedings - International Conference on Data Engineering",

}

TY - GEN

T1 - Holistic data cleaning

T2 - Putting violations into context

AU - Chu, Xu

AU - Ilyas, Ihab F.

AU - Papotti, Paolo

PY - 2013/8/15

Y1 - 2013/8/15

N2 - Data cleaning is an important problem and data quality rules are the most promising way to face it with a declarative approach. Previous work has focused on specific formalisms, such as functional dependencies (FDs), conditional functional dependencies (CFDs), and matching dependencies (MDs), and those have always been studied in isolation. Moreover, such techniques are usually applied in a pipeline or interleaved. In this work we tackle the problem in a novel, unified framework. First, we let users specify quality rules using denial constraints with ad-hoc predicates. This language subsumes existing formalisms and can express rules involving numerical values, with predicates such as "greater than" and "less than". More importantly, we exploit the interaction of the heterogeneous constraints by encoding them in a conflict hypergraph. Such holistic view of the conflicts is the starting point for a novel definition of repair context which allows us to compute automatically repairs of better quality w.r.t. previous approaches in the literature. Experimental results on real datasets show that the holistic approach outperforms previous algorithms in terms of quality and efficiency of the repair.

AB - Data cleaning is an important problem and data quality rules are the most promising way to face it with a declarative approach. Previous work has focused on specific formalisms, such as functional dependencies (FDs), conditional functional dependencies (CFDs), and matching dependencies (MDs), and those have always been studied in isolation. Moreover, such techniques are usually applied in a pipeline or interleaved. In this work we tackle the problem in a novel, unified framework. First, we let users specify quality rules using denial constraints with ad-hoc predicates. This language subsumes existing formalisms and can express rules involving numerical values, with predicates such as "greater than" and "less than". More importantly, we exploit the interaction of the heterogeneous constraints by encoding them in a conflict hypergraph. Such holistic view of the conflicts is the starting point for a novel definition of repair context which allows us to compute automatically repairs of better quality w.r.t. previous approaches in the literature. Experimental results on real datasets show that the holistic approach outperforms previous algorithms in terms of quality and efficiency of the repair.

UR - http://www.scopus.com/inward/record.url?scp=84881365460&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=84881365460&partnerID=8YFLogxK

U2 - 10.1109/ICDE.2013.6544847

DO - 10.1109/ICDE.2013.6544847

M3 - Conference contribution

AN - SCOPUS:84881365460

SN - 9781467349086

SP - 458

EP - 469

BT - Proceedings - International Conference on Data Engineering

ER -