Messing up with BART: Error generation for evaluating data-cleaning algorithms

Patricia C. Arocena, Boris Glavic, Giansalvatore Mecca, Renée J. Miller, Paolo Papotti, Donatello Santoro

Research output: Chapter in Book/Report/Conference proceedingChapter

22 Citations (Scopus)

Abstract

We study the problem of introducing errors into clean databases for the purpose of benchmarking data-cleaning algorithms. Our goal is to provide users with the highest possible level of control over the error-generation process, and at the same time develop solutions that scale to large databases. We show in the paper that the error-generation problem is surprisingly challenging, and in fact, NP-complete. To provide a scalable solution, we develop a correct and efficient greedy algorithm that srifices completeness, but succeeds under very reasonable assumptions. To scale to millions of tuples, the algorithm relies on several non-trivial optimizations, including a new symmetry property of data quality constraints. The trade-off between control and scalability is the main technical contribution of the paper.

Original languageEnglish
Title of host publicationProceedings of the VLDB Endowment
PublisherAssociation for Computing Machinery
Pages36-47
Number of pages12
Volume9
Edition2
Publication statusPublished - 2016
Event42nd International Conference on Very Large Data Bases, VLDB 2016 - Delhi, India
Duration: 5 Sep 20169 Sep 2016

Other

Other42nd International Conference on Very Large Data Bases, VLDB 2016
CountryIndia
CityDelhi
Period5/9/169/9/16

    Fingerprint

ASJC Scopus subject areas

  • Computer Science (miscellaneous)
  • Computer Science(all)

Cite this

Arocena, P. C., Glavic, B., Mecca, G., Miller, R. J., Papotti, P., & Santoro, D. (2016). Messing up with BART: Error generation for evaluating data-cleaning algorithms. In Proceedings of the VLDB Endowment (2 ed., Vol. 9, pp. 36-47). Association for Computing Machinery.