UGuide - User-guided discovery of FD-detectable errors

Research output: Chapter in Book/Report/Conference proceedingConference contribution

10 Citations (Scopus)

Abstract

Error detection is the process of identifying problematic data cells that are different from their ground truth. Functional dependencies (FDs) have been widely studied in support of this process. Oftentimes, it is assumed that FDs are given by experts. Unfortunately, it is usually hard and expensive for the experts to define such FDs. In addition, automatic data profiling over dirty data in order to find correct FDs is known to be a hard problem. In this paper, we propose an end-to-end solution to detect FD-detectable errors from dirty data. The broad intuition is that given a dirty dataset, it is feasible to automatically find approximate FDs, as well as data that is possibly erroneous. Arguably, at this point, only experts can confirm true FDs or true errors. However, in practice, experts never have enough budget to find all errors. Hence, our problem is, given a limited budget of expert's time, which questions we should ask, either FDs, cells, or tuples, such that we can find as many data errors as possible. We present efficient algorithms to interact with the user. Extensive experiments demonstrate that our proposed framework is effective in detecting errors from dirty data.

Original languageEnglish
Title of host publicationSIGMOD 2017 - Proceedings of the 2017 ACM International Conference on Management of Data
PublisherAssociation for Computing Machinery
Pages1385-1397
Number of pages13
VolumePart F127746
ISBN (Electronic)9781450341974
DOIs
Publication statusPublished - 9 May 2017
Event2017 ACM SIGMOD International Conference on Management of Data, SIGMOD 2017 - Chicago, United States
Duration: 14 May 201719 May 2017

Other

Other2017 ACM SIGMOD International Conference on Management of Data, SIGMOD 2017
CountryUnited States
CityChicago
Period14/5/1719/5/17

Fingerprint

Error detection
Experiments

ASJC Scopus subject areas

  • Software
  • Information Systems

Cite this

Thirumuruganathan, S., Berti-Equille, L., Ouzzani, M., Quiane Ruiz, J. A., & Tang, N. (2017). UGuide - User-guided discovery of FD-detectable errors. In SIGMOD 2017 - Proceedings of the 2017 ACM International Conference on Management of Data (Vol. Part F127746, pp. 1385-1397). Association for Computing Machinery. https://doi.org/10.1145/3035918.3064024

UGuide - User-guided discovery of FD-detectable errors. / Thirumuruganathan, Saravanan; Berti-Equille, Laure; Ouzzani, Mourad; Quiane Ruiz, Jorge Arnulfo; Tang, Nan.

SIGMOD 2017 - Proceedings of the 2017 ACM International Conference on Management of Data. Vol. Part F127746 Association for Computing Machinery, 2017. p. 1385-1397.

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Thirumuruganathan, S, Berti-Equille, L, Ouzzani, M, Quiane Ruiz, JA & Tang, N 2017, UGuide - User-guided discovery of FD-detectable errors. in SIGMOD 2017 - Proceedings of the 2017 ACM International Conference on Management of Data. vol. Part F127746, Association for Computing Machinery, pp. 1385-1397, 2017 ACM SIGMOD International Conference on Management of Data, SIGMOD 2017, Chicago, United States, 14/5/17. https://doi.org/10.1145/3035918.3064024
Thirumuruganathan S, Berti-Equille L, Ouzzani M, Quiane Ruiz JA, Tang N. UGuide - User-guided discovery of FD-detectable errors. In SIGMOD 2017 - Proceedings of the 2017 ACM International Conference on Management of Data. Vol. Part F127746. Association for Computing Machinery. 2017. p. 1385-1397 https://doi.org/10.1145/3035918.3064024
Thirumuruganathan, Saravanan ; Berti-Equille, Laure ; Ouzzani, Mourad ; Quiane Ruiz, Jorge Arnulfo ; Tang, Nan. / UGuide - User-guided discovery of FD-detectable errors. SIGMOD 2017 - Proceedings of the 2017 ACM International Conference on Management of Data. Vol. Part F127746 Association for Computing Machinery, 2017. pp. 1385-1397
@inproceedings{07f73c8f1e0d47298493ce4839aef392,
title = "UGuide - User-guided discovery of FD-detectable errors",
abstract = "Error detection is the process of identifying problematic data cells that are different from their ground truth. Functional dependencies (FDs) have been widely studied in support of this process. Oftentimes, it is assumed that FDs are given by experts. Unfortunately, it is usually hard and expensive for the experts to define such FDs. In addition, automatic data profiling over dirty data in order to find correct FDs is known to be a hard problem. In this paper, we propose an end-to-end solution to detect FD-detectable errors from dirty data. The broad intuition is that given a dirty dataset, it is feasible to automatically find approximate FDs, as well as data that is possibly erroneous. Arguably, at this point, only experts can confirm true FDs or true errors. However, in practice, experts never have enough budget to find all errors. Hence, our problem is, given a limited budget of expert's time, which questions we should ask, either FDs, cells, or tuples, such that we can find as many data errors as possible. We present efficient algorithms to interact with the user. Extensive experiments demonstrate that our proposed framework is effective in detecting errors from dirty data.",
author = "Saravanan Thirumuruganathan and Laure Berti-Equille and Mourad Ouzzani and {Quiane Ruiz}, {Jorge Arnulfo} and Nan Tang",
year = "2017",
month = "5",
day = "9",
doi = "10.1145/3035918.3064024",
language = "English",
volume = "Part F127746",
pages = "1385--1397",
booktitle = "SIGMOD 2017 - Proceedings of the 2017 ACM International Conference on Management of Data",
publisher = "Association for Computing Machinery",

}

TY - GEN

T1 - UGuide - User-guided discovery of FD-detectable errors

AU - Thirumuruganathan, Saravanan

AU - Berti-Equille, Laure

AU - Ouzzani, Mourad

AU - Quiane Ruiz, Jorge Arnulfo

AU - Tang, Nan

PY - 2017/5/9

Y1 - 2017/5/9

N2 - Error detection is the process of identifying problematic data cells that are different from their ground truth. Functional dependencies (FDs) have been widely studied in support of this process. Oftentimes, it is assumed that FDs are given by experts. Unfortunately, it is usually hard and expensive for the experts to define such FDs. In addition, automatic data profiling over dirty data in order to find correct FDs is known to be a hard problem. In this paper, we propose an end-to-end solution to detect FD-detectable errors from dirty data. The broad intuition is that given a dirty dataset, it is feasible to automatically find approximate FDs, as well as data that is possibly erroneous. Arguably, at this point, only experts can confirm true FDs or true errors. However, in practice, experts never have enough budget to find all errors. Hence, our problem is, given a limited budget of expert's time, which questions we should ask, either FDs, cells, or tuples, such that we can find as many data errors as possible. We present efficient algorithms to interact with the user. Extensive experiments demonstrate that our proposed framework is effective in detecting errors from dirty data.

AB - Error detection is the process of identifying problematic data cells that are different from their ground truth. Functional dependencies (FDs) have been widely studied in support of this process. Oftentimes, it is assumed that FDs are given by experts. Unfortunately, it is usually hard and expensive for the experts to define such FDs. In addition, automatic data profiling over dirty data in order to find correct FDs is known to be a hard problem. In this paper, we propose an end-to-end solution to detect FD-detectable errors from dirty data. The broad intuition is that given a dirty dataset, it is feasible to automatically find approximate FDs, as well as data that is possibly erroneous. Arguably, at this point, only experts can confirm true FDs or true errors. However, in practice, experts never have enough budget to find all errors. Hence, our problem is, given a limited budget of expert's time, which questions we should ask, either FDs, cells, or tuples, such that we can find as many data errors as possible. We present efficient algorithms to interact with the user. Extensive experiments demonstrate that our proposed framework is effective in detecting errors from dirty data.

UR - http://www.scopus.com/inward/record.url?scp=85021204207&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=85021204207&partnerID=8YFLogxK

U2 - 10.1145/3035918.3064024

DO - 10.1145/3035918.3064024

M3 - Conference contribution

VL - Part F127746

SP - 1385

EP - 1397

BT - SIGMOD 2017 - Proceedings of the 2017 ACM International Conference on Management of Data

PB - Association for Computing Machinery

ER -