Proof positive and negative in data cleaning

Matteo Interlandi, Nan Tang

Research output: Chapter in Book/Report/Conference proceedingConference contribution

21 Citations (Scopus)

Abstract

One notoriously hard data cleaning problem is, given a database, how to precisely capture which value is correct (i.e., proof positive) or wrong (i.e., proof negative). Although integrity constraints have been widely studied to capture data errors as violations, the accuracy of data cleaning using integrity constraints has long been controversial. Overall they deem one fundamental problem: Given a set of data values that together forms a violation, there is no evidence of which value is proof positive or negative. Hence, it is known that integrity constraints themselves cannot guide dependable data cleaning. In this work, we introduce an automated method for proof positive and negative in data cleaning, based on Sherlock rules and reference tables. Given a tuple and reference tables, Sherlock rules tell us what attributes are proof positive, what attributes are proof negative and (possibly) how to update them. We study several fundamental problems associated with Sherlock rules. We also present efficient algorithms for cleaning data using Sherlock rules. We experimentally demonstrate that our techniques can not only annotate data with proof positive and negative, but also repair data when enough information is available.

Original languageEnglish
Title of host publicationProceedings - International Conference on Data Engineering
PublisherIEEE Computer Society
Pages18-29
Number of pages12
Volume2015-May
ISBN (Print)9781479979639
DOIs
Publication statusPublished - 26 May 2015
Event2015 31st IEEE International Conference on Data Engineering, ICDE 2015 - Seoul, Korea, Republic of
Duration: 13 Apr 201517 Apr 2015

Other

Other2015 31st IEEE International Conference on Data Engineering, ICDE 2015
CountryKorea, Republic of
CitySeoul
Period13/4/1517/4/15

Fingerprint

Cleaning
Data acquisition
Repair

ASJC Scopus subject areas

  • Information Systems
  • Signal Processing
  • Software

Cite this

Interlandi, M., & Tang, N. (2015). Proof positive and negative in data cleaning. In Proceedings - International Conference on Data Engineering (Vol. 2015-May, pp. 18-29). [7113269] IEEE Computer Society. https://doi.org/10.1109/ICDE.2015.7113269

Proof positive and negative in data cleaning. / Interlandi, Matteo; Tang, Nan.

Proceedings - International Conference on Data Engineering. Vol. 2015-May IEEE Computer Society, 2015. p. 18-29 7113269.

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Interlandi, M & Tang, N 2015, Proof positive and negative in data cleaning. in Proceedings - International Conference on Data Engineering. vol. 2015-May, 7113269, IEEE Computer Society, pp. 18-29, 2015 31st IEEE International Conference on Data Engineering, ICDE 2015, Seoul, Korea, Republic of, 13/4/15. https://doi.org/10.1109/ICDE.2015.7113269
Interlandi M, Tang N. Proof positive and negative in data cleaning. In Proceedings - International Conference on Data Engineering. Vol. 2015-May. IEEE Computer Society. 2015. p. 18-29. 7113269 https://doi.org/10.1109/ICDE.2015.7113269
Interlandi, Matteo ; Tang, Nan. / Proof positive and negative in data cleaning. Proceedings - International Conference on Data Engineering. Vol. 2015-May IEEE Computer Society, 2015. pp. 18-29
@inproceedings{0baf95d524fb42ba8b9bba5c1fbe9566,
title = "Proof positive and negative in data cleaning",
abstract = "One notoriously hard data cleaning problem is, given a database, how to precisely capture which value is correct (i.e., proof positive) or wrong (i.e., proof negative). Although integrity constraints have been widely studied to capture data errors as violations, the accuracy of data cleaning using integrity constraints has long been controversial. Overall they deem one fundamental problem: Given a set of data values that together forms a violation, there is no evidence of which value is proof positive or negative. Hence, it is known that integrity constraints themselves cannot guide dependable data cleaning. In this work, we introduce an automated method for proof positive and negative in data cleaning, based on Sherlock rules and reference tables. Given a tuple and reference tables, Sherlock rules tell us what attributes are proof positive, what attributes are proof negative and (possibly) how to update them. We study several fundamental problems associated with Sherlock rules. We also present efficient algorithms for cleaning data using Sherlock rules. We experimentally demonstrate that our techniques can not only annotate data with proof positive and negative, but also repair data when enough information is available.",
author = "Matteo Interlandi and Nan Tang",
year = "2015",
month = "5",
day = "26",
doi = "10.1109/ICDE.2015.7113269",
language = "English",
isbn = "9781479979639",
volume = "2015-May",
pages = "18--29",
booktitle = "Proceedings - International Conference on Data Engineering",
publisher = "IEEE Computer Society",

}

TY - GEN

T1 - Proof positive and negative in data cleaning

AU - Interlandi, Matteo

AU - Tang, Nan

PY - 2015/5/26

Y1 - 2015/5/26

N2 - One notoriously hard data cleaning problem is, given a database, how to precisely capture which value is correct (i.e., proof positive) or wrong (i.e., proof negative). Although integrity constraints have been widely studied to capture data errors as violations, the accuracy of data cleaning using integrity constraints has long been controversial. Overall they deem one fundamental problem: Given a set of data values that together forms a violation, there is no evidence of which value is proof positive or negative. Hence, it is known that integrity constraints themselves cannot guide dependable data cleaning. In this work, we introduce an automated method for proof positive and negative in data cleaning, based on Sherlock rules and reference tables. Given a tuple and reference tables, Sherlock rules tell us what attributes are proof positive, what attributes are proof negative and (possibly) how to update them. We study several fundamental problems associated with Sherlock rules. We also present efficient algorithms for cleaning data using Sherlock rules. We experimentally demonstrate that our techniques can not only annotate data with proof positive and negative, but also repair data when enough information is available.

AB - One notoriously hard data cleaning problem is, given a database, how to precisely capture which value is correct (i.e., proof positive) or wrong (i.e., proof negative). Although integrity constraints have been widely studied to capture data errors as violations, the accuracy of data cleaning using integrity constraints has long been controversial. Overall they deem one fundamental problem: Given a set of data values that together forms a violation, there is no evidence of which value is proof positive or negative. Hence, it is known that integrity constraints themselves cannot guide dependable data cleaning. In this work, we introduce an automated method for proof positive and negative in data cleaning, based on Sherlock rules and reference tables. Given a tuple and reference tables, Sherlock rules tell us what attributes are proof positive, what attributes are proof negative and (possibly) how to update them. We study several fundamental problems associated with Sherlock rules. We also present efficient algorithms for cleaning data using Sherlock rules. We experimentally demonstrate that our techniques can not only annotate data with proof positive and negative, but also repair data when enough information is available.

UR - http://www.scopus.com/inward/record.url?scp=84940824296&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=84940824296&partnerID=8YFLogxK

U2 - 10.1109/ICDE.2015.7113269

DO - 10.1109/ICDE.2015.7113269

M3 - Conference contribution

AN - SCOPUS:84940824296

SN - 9781479979639

VL - 2015-May

SP - 18

EP - 29

BT - Proceedings - International Conference on Data Engineering

PB - IEEE Computer Society

ER -