UGuide - User-guided discovery of FD-detectable errors

Research output: Chapter in Book/Report/Conference proceedingConference contribution

10 Citations (Scopus)

Abstract

Error detection is the process of identifying problematic data cells that are different from their ground truth. Functional dependencies (FDs) have been widely studied in support of this process. Oftentimes, it is assumed that FDs are given by experts. Unfortunately, it is usually hard and expensive for the experts to define such FDs. In addition, automatic data profiling over dirty data in order to find correct FDs is known to be a hard problem. In this paper, we propose an end-to-end solution to detect FD-detectable errors from dirty data. The broad intuition is that given a dirty dataset, it is feasible to automatically find approximate FDs, as well as data that is possibly erroneous. Arguably, at this point, only experts can confirm true FDs or true errors. However, in practice, experts never have enough budget to find all errors. Hence, our problem is, given a limited budget of expert's time, which questions we should ask, either FDs, cells, or tuples, such that we can find as many data errors as possible. We present efficient algorithms to interact with the user. Extensive experiments demonstrate that our proposed framework is effective in detecting errors from dirty data.

Original languageEnglish
Title of host publicationSIGMOD 2017 - Proceedings of the 2017 ACM International Conference on Management of Data
PublisherAssociation for Computing Machinery
Pages1385-1397
Number of pages13
VolumePart F127746
ISBN (Electronic)9781450341974
DOIs
Publication statusPublished - 9 May 2017
Event2017 ACM SIGMOD International Conference on Management of Data, SIGMOD 2017 - Chicago, United States
Duration: 14 May 201719 May 2017

Other

Other2017 ACM SIGMOD International Conference on Management of Data, SIGMOD 2017
CountryUnited States
CityChicago
Period14/5/1719/5/17

    Fingerprint

ASJC Scopus subject areas

  • Software
  • Information Systems

Cite this

Thirumuruganathan, S., Berti-Equille, L., Ouzzani, M., Quiane Ruiz, J. A., & Tang, N. (2017). UGuide - User-guided discovery of FD-detectable errors. In SIGMOD 2017 - Proceedings of the 2017 ACM International Conference on Management of Data (Vol. Part F127746, pp. 1385-1397). Association for Computing Machinery. https://doi.org/10.1145/3035918.3064024