Messing up with BART

Error generation for evaluating data-cleaning algorithms

Patricia C. Arocena, Boris Glavic, Giansalvatore Mecca, Renée J. Miller, Paolo Papotti, Donatello Santoro

Research output: Chapter in Book/Report/Conference proceedingChapter

18 Citations (Scopus)

Abstract

We study the problem of introducing errors into clean databases for the purpose of benchmarking data-cleaning algorithms. Our goal is to provide users with the highest possible level of control over the error-generation process, and at the same time develop solutions that scale to large databases. We show in the paper that the error-generation problem is surprisingly challenging, and in fact, NP-complete. To provide a scalable solution, we develop a correct and efficient greedy algorithm that srifices completeness, but succeeds under very reasonable assumptions. To scale to millions of tuples, the algorithm relies on several non-trivial optimizations, including a new symmetry property of data quality constraints. The trade-off between control and scalability is the main technical contribution of the paper.

Original languageEnglish
Title of host publicationProceedings of the VLDB Endowment
PublisherAssociation for Computing Machinery
Pages36-47
Number of pages12
Volume9
Edition2
Publication statusPublished - 2016
Event42nd International Conference on Very Large Data Bases, VLDB 2016 - Delhi, India
Duration: 5 Sep 20169 Sep 2016

Other

Other42nd International Conference on Very Large Data Bases, VLDB 2016
CountryIndia
CityDelhi
Period5/9/169/9/16

Fingerprint

Cleaning
Benchmarking
Scalability

ASJC Scopus subject areas

  • Computer Science (miscellaneous)
  • Computer Science(all)

Cite this

Arocena, P. C., Glavic, B., Mecca, G., Miller, R. J., Papotti, P., & Santoro, D. (2016). Messing up with BART: Error generation for evaluating data-cleaning algorithms. In Proceedings of the VLDB Endowment (2 ed., Vol. 9, pp. 36-47). Association for Computing Machinery.

Messing up with BART : Error generation for evaluating data-cleaning algorithms. / Arocena, Patricia C.; Glavic, Boris; Mecca, Giansalvatore; Miller, Renée J.; Papotti, Paolo; Santoro, Donatello.

Proceedings of the VLDB Endowment. Vol. 9 2. ed. Association for Computing Machinery, 2016. p. 36-47.

Research output: Chapter in Book/Report/Conference proceedingChapter

Arocena, PC, Glavic, B, Mecca, G, Miller, RJ, Papotti, P & Santoro, D 2016, Messing up with BART: Error generation for evaluating data-cleaning algorithms. in Proceedings of the VLDB Endowment. 2 edn, vol. 9, Association for Computing Machinery, pp. 36-47, 42nd International Conference on Very Large Data Bases, VLDB 2016, Delhi, India, 5/9/16.
Arocena PC, Glavic B, Mecca G, Miller RJ, Papotti P, Santoro D. Messing up with BART: Error generation for evaluating data-cleaning algorithms. In Proceedings of the VLDB Endowment. 2 ed. Vol. 9. Association for Computing Machinery. 2016. p. 36-47
Arocena, Patricia C. ; Glavic, Boris ; Mecca, Giansalvatore ; Miller, Renée J. ; Papotti, Paolo ; Santoro, Donatello. / Messing up with BART : Error generation for evaluating data-cleaning algorithms. Proceedings of the VLDB Endowment. Vol. 9 2. ed. Association for Computing Machinery, 2016. pp. 36-47
@inbook{bc4975ccf520434ab217844b6d9bb1a3,
title = "Messing up with BART: Error generation for evaluating data-cleaning algorithms",
abstract = "We study the problem of introducing errors into clean databases for the purpose of benchmarking data-cleaning algorithms. Our goal is to provide users with the highest possible level of control over the error-generation process, and at the same time develop solutions that scale to large databases. We show in the paper that the error-generation problem is surprisingly challenging, and in fact, NP-complete. To provide a scalable solution, we develop a correct and efficient greedy algorithm that srifices completeness, but succeeds under very reasonable assumptions. To scale to millions of tuples, the algorithm relies on several non-trivial optimizations, including a new symmetry property of data quality constraints. The trade-off between control and scalability is the main technical contribution of the paper.",
author = "Arocena, {Patricia C.} and Boris Glavic and Giansalvatore Mecca and Miller, {Ren{\'e}e J.} and Paolo Papotti and Donatello Santoro",
year = "2016",
language = "English",
volume = "9",
pages = "36--47",
booktitle = "Proceedings of the VLDB Endowment",
publisher = "Association for Computing Machinery",
edition = "2",

}

TY - CHAP

T1 - Messing up with BART

T2 - Error generation for evaluating data-cleaning algorithms

AU - Arocena, Patricia C.

AU - Glavic, Boris

AU - Mecca, Giansalvatore

AU - Miller, Renée J.

AU - Papotti, Paolo

AU - Santoro, Donatello

PY - 2016

Y1 - 2016

N2 - We study the problem of introducing errors into clean databases for the purpose of benchmarking data-cleaning algorithms. Our goal is to provide users with the highest possible level of control over the error-generation process, and at the same time develop solutions that scale to large databases. We show in the paper that the error-generation problem is surprisingly challenging, and in fact, NP-complete. To provide a scalable solution, we develop a correct and efficient greedy algorithm that srifices completeness, but succeeds under very reasonable assumptions. To scale to millions of tuples, the algorithm relies on several non-trivial optimizations, including a new symmetry property of data quality constraints. The trade-off between control and scalability is the main technical contribution of the paper.

AB - We study the problem of introducing errors into clean databases for the purpose of benchmarking data-cleaning algorithms. Our goal is to provide users with the highest possible level of control over the error-generation process, and at the same time develop solutions that scale to large databases. We show in the paper that the error-generation problem is surprisingly challenging, and in fact, NP-complete. To provide a scalable solution, we develop a correct and efficient greedy algorithm that srifices completeness, but succeeds under very reasonable assumptions. To scale to millions of tuples, the algorithm relies on several non-trivial optimizations, including a new symmetry property of data quality constraints. The trade-off between control and scalability is the main technical contribution of the paper.

UR - http://www.scopus.com/inward/record.url?scp=84975824359&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=84975824359&partnerID=8YFLogxK

M3 - Chapter

VL - 9

SP - 36

EP - 47

BT - Proceedings of the VLDB Endowment

PB - Association for Computing Machinery

ER -