Descriptive and prescriptive data cleaning

Anup Chalamalla, Ihab F. Ilyas, Mourad Ouzzani, Paolo Papotti

Research output: Chapter in Book/Report/Conference proceedingConference contribution

27 Citations (Scopus)

Abstract

Data cleaning techniques usually rely on some quality rules to identify violating tuples, and then fix these violations using some repair algorithms. Oftentimes, the rules, which are related to the business logic, can only be defined on some target report generated by transformations over multiple data sources. This creates a situation where the violations detected in the report are decoupled in space and time from the actual source of errors. In addition, applying the repair on the report would need to be repeated whenever the data sources change. Finally, even if repairing the report is possible and affordable, this would be of little help towards identifying and analyzing the actual sources of errors for future prevention of violations at the target. In this paper, we propose a system to address this decoupling. The system takes quality rules defined over the output of a transformation and computes explanations of the errors seen on the output. This is performed both at the target level to describe these errors and at the source level to prescribe actions to solve them. We present scalable techniques to detect, propagate, and explain errors. We also study the effectiveness and efficiency of our techniques using the TPC-H Benchmark for different scenarios and classes of quality rules.

Original languageEnglish
Title of host publicationProceedings of the ACM SIGMOD International Conference on Management of Data
PublisherAssociation for Computing Machinery
Pages445-456
Number of pages12
ISBN (Print)9781450323765
DOIs
Publication statusPublished - 1 Jan 2014
Event2014 ACM SIGMOD International Conference on Management of Data, SIGMOD 2014 - Snowbird, UT, United States
Duration: 22 Jun 201427 Jun 2014

Other

Other2014 ACM SIGMOD International Conference on Management of Data, SIGMOD 2014
CountryUnited States
CitySnowbird, UT
Period22/6/1427/6/14

Fingerprint

Cleaning
Repair
Industry

ASJC Scopus subject areas

  • Software
  • Information Systems

Cite this

Chalamalla, A., Ilyas, I. F., Ouzzani, M., & Papotti, P. (2014). Descriptive and prescriptive data cleaning. In Proceedings of the ACM SIGMOD International Conference on Management of Data (pp. 445-456). Association for Computing Machinery. https://doi.org/10.1145/2588555.2610520

Descriptive and prescriptive data cleaning. / Chalamalla, Anup; Ilyas, Ihab F.; Ouzzani, Mourad; Papotti, Paolo.

Proceedings of the ACM SIGMOD International Conference on Management of Data. Association for Computing Machinery, 2014. p. 445-456.

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Chalamalla, A, Ilyas, IF, Ouzzani, M & Papotti, P 2014, Descriptive and prescriptive data cleaning. in Proceedings of the ACM SIGMOD International Conference on Management of Data. Association for Computing Machinery, pp. 445-456, 2014 ACM SIGMOD International Conference on Management of Data, SIGMOD 2014, Snowbird, UT, United States, 22/6/14. https://doi.org/10.1145/2588555.2610520
Chalamalla A, Ilyas IF, Ouzzani M, Papotti P. Descriptive and prescriptive data cleaning. In Proceedings of the ACM SIGMOD International Conference on Management of Data. Association for Computing Machinery. 2014. p. 445-456 https://doi.org/10.1145/2588555.2610520
Chalamalla, Anup ; Ilyas, Ihab F. ; Ouzzani, Mourad ; Papotti, Paolo. / Descriptive and prescriptive data cleaning. Proceedings of the ACM SIGMOD International Conference on Management of Data. Association for Computing Machinery, 2014. pp. 445-456
@inproceedings{1443e94bfa794827bba66f62a1ed1dde,
title = "Descriptive and prescriptive data cleaning",
abstract = "Data cleaning techniques usually rely on some quality rules to identify violating tuples, and then fix these violations using some repair algorithms. Oftentimes, the rules, which are related to the business logic, can only be defined on some target report generated by transformations over multiple data sources. This creates a situation where the violations detected in the report are decoupled in space and time from the actual source of errors. In addition, applying the repair on the report would need to be repeated whenever the data sources change. Finally, even if repairing the report is possible and affordable, this would be of little help towards identifying and analyzing the actual sources of errors for future prevention of violations at the target. In this paper, we propose a system to address this decoupling. The system takes quality rules defined over the output of a transformation and computes explanations of the errors seen on the output. This is performed both at the target level to describe these errors and at the source level to prescribe actions to solve them. We present scalable techniques to detect, propagate, and explain errors. We also study the effectiveness and efficiency of our techniques using the TPC-H Benchmark for different scenarios and classes of quality rules.",
author = "Anup Chalamalla and Ilyas, {Ihab F.} and Mourad Ouzzani and Paolo Papotti",
year = "2014",
month = "1",
day = "1",
doi = "10.1145/2588555.2610520",
language = "English",
isbn = "9781450323765",
pages = "445--456",
booktitle = "Proceedings of the ACM SIGMOD International Conference on Management of Data",
publisher = "Association for Computing Machinery",

}

TY - GEN

T1 - Descriptive and prescriptive data cleaning

AU - Chalamalla, Anup

AU - Ilyas, Ihab F.

AU - Ouzzani, Mourad

AU - Papotti, Paolo

PY - 2014/1/1

Y1 - 2014/1/1

N2 - Data cleaning techniques usually rely on some quality rules to identify violating tuples, and then fix these violations using some repair algorithms. Oftentimes, the rules, which are related to the business logic, can only be defined on some target report generated by transformations over multiple data sources. This creates a situation where the violations detected in the report are decoupled in space and time from the actual source of errors. In addition, applying the repair on the report would need to be repeated whenever the data sources change. Finally, even if repairing the report is possible and affordable, this would be of little help towards identifying and analyzing the actual sources of errors for future prevention of violations at the target. In this paper, we propose a system to address this decoupling. The system takes quality rules defined over the output of a transformation and computes explanations of the errors seen on the output. This is performed both at the target level to describe these errors and at the source level to prescribe actions to solve them. We present scalable techniques to detect, propagate, and explain errors. We also study the effectiveness and efficiency of our techniques using the TPC-H Benchmark for different scenarios and classes of quality rules.

AB - Data cleaning techniques usually rely on some quality rules to identify violating tuples, and then fix these violations using some repair algorithms. Oftentimes, the rules, which are related to the business logic, can only be defined on some target report generated by transformations over multiple data sources. This creates a situation where the violations detected in the report are decoupled in space and time from the actual source of errors. In addition, applying the repair on the report would need to be repeated whenever the data sources change. Finally, even if repairing the report is possible and affordable, this would be of little help towards identifying and analyzing the actual sources of errors for future prevention of violations at the target. In this paper, we propose a system to address this decoupling. The system takes quality rules defined over the output of a transformation and computes explanations of the errors seen on the output. This is performed both at the target level to describe these errors and at the source level to prescribe actions to solve them. We present scalable techniques to detect, propagate, and explain errors. We also study the effectiveness and efficiency of our techniques using the TPC-H Benchmark for different scenarios and classes of quality rules.

UR - http://www.scopus.com/inward/record.url?scp=84904358688&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=84904358688&partnerID=8YFLogxK

U2 - 10.1145/2588555.2610520

DO - 10.1145/2588555.2610520

M3 - Conference contribution

SN - 9781450323765

SP - 445

EP - 456

BT - Proceedings of the ACM SIGMOD International Conference on Management of Data

PB - Association for Computing Machinery

ER -