Interactive and deterministic data cleaning: A tossed stone raises a thousand ripples

Jian He, Enzo Veltri, Donatello Santoro, Guoliang Li, Giansalvatore Mecca, Paolo Papotti, Nan Tang

Research output: Chapter in Book/Report/Conference proceedingConference contribution

26 Citations (Scopus)

Abstract

We present Falcon, an interactive, deterministic, and declarative data cleaning system, which uses SQL update queries as the language to repair data. Falcon does not rely on the existence of a set of pre-defined data quality rules. On the contrary, it encourages users to explore the data, identify possible problems, and make updates to fix them. Bootstrapped by one user update, Falcon guesses a set of possible SQL update queries that can be used to repair the data. The main technical challenge addressed in this paper consists in finding a set of SQL update queries that is minimal in size and at the same time fixes the largest number of errors in the data. We formalize this problem as a search in a lattice-shaped space. To guarantee that the chosen updates are semantically correct, Falcon navigates the lattice by interacting with users to gradually validate the set of SQL update queries. Besides using traditional one-hop based traverse algorithms (e.g., BFS or DFS), we describe novel multi-hop search algorithms such that Falcon can dive over the lattice and conduct the search efficiently. Our novel search strategy is coupled with a number of optimization techniques to further prune the search space and efficiently maintain the lattice. We have conducted extensive experiments using both real-world and synthetic datasets to show that Falcon can effectively communicate with users in data repairing.

Original languageEnglish
Title of host publicationSIGMOD 2016 - Proceedings of the 2016 International Conference on Management of Data
PublisherAssociation for Computing Machinery
Pages893-907
Number of pages15
Volume26-June-2016
ISBN (Electronic)9781450335317
DOIs
Publication statusPublished - 26 Jun 2016
Event2016 ACM SIGMOD International Conference on Management of Data, SIGMOD 2016 - San Francisco, United States
Duration: 26 Jun 20161 Jul 2016

Other

Other2016 ACM SIGMOD International Conference on Management of Data, SIGMOD 2016
CountryUnited States
CitySan Francisco
Period26/6/161/7/16

Fingerprint

Cleaning
Repair
Experiments

Keywords

  • Data cleaning
  • Declarative
  • Deterministic
  • Interactive

ASJC Scopus subject areas

  • Software
  • Information Systems

Cite this

He, J., Veltri, E., Santoro, D., Li, G., Mecca, G., Papotti, P., & Tang, N. (2016). Interactive and deterministic data cleaning: A tossed stone raises a thousand ripples. In SIGMOD 2016 - Proceedings of the 2016 International Conference on Management of Data (Vol. 26-June-2016, pp. 893-907). Association for Computing Machinery. https://doi.org/10.1145/2882903.2915242

Interactive and deterministic data cleaning : A tossed stone raises a thousand ripples. / He, Jian; Veltri, Enzo; Santoro, Donatello; Li, Guoliang; Mecca, Giansalvatore; Papotti, Paolo; Tang, Nan.

SIGMOD 2016 - Proceedings of the 2016 International Conference on Management of Data. Vol. 26-June-2016 Association for Computing Machinery, 2016. p. 893-907.

Research output: Chapter in Book/Report/Conference proceedingConference contribution

He, J, Veltri, E, Santoro, D, Li, G, Mecca, G, Papotti, P & Tang, N 2016, Interactive and deterministic data cleaning: A tossed stone raises a thousand ripples. in SIGMOD 2016 - Proceedings of the 2016 International Conference on Management of Data. vol. 26-June-2016, Association for Computing Machinery, pp. 893-907, 2016 ACM SIGMOD International Conference on Management of Data, SIGMOD 2016, San Francisco, United States, 26/6/16. https://doi.org/10.1145/2882903.2915242
He J, Veltri E, Santoro D, Li G, Mecca G, Papotti P et al. Interactive and deterministic data cleaning: A tossed stone raises a thousand ripples. In SIGMOD 2016 - Proceedings of the 2016 International Conference on Management of Data. Vol. 26-June-2016. Association for Computing Machinery. 2016. p. 893-907 https://doi.org/10.1145/2882903.2915242
He, Jian ; Veltri, Enzo ; Santoro, Donatello ; Li, Guoliang ; Mecca, Giansalvatore ; Papotti, Paolo ; Tang, Nan. / Interactive and deterministic data cleaning : A tossed stone raises a thousand ripples. SIGMOD 2016 - Proceedings of the 2016 International Conference on Management of Data. Vol. 26-June-2016 Association for Computing Machinery, 2016. pp. 893-907
@inproceedings{d93d3b5268994081832f3a01bcb1ceb6,
title = "Interactive and deterministic data cleaning: A tossed stone raises a thousand ripples",
abstract = "We present Falcon, an interactive, deterministic, and declarative data cleaning system, which uses SQL update queries as the language to repair data. Falcon does not rely on the existence of a set of pre-defined data quality rules. On the contrary, it encourages users to explore the data, identify possible problems, and make updates to fix them. Bootstrapped by one user update, Falcon guesses a set of possible SQL update queries that can be used to repair the data. The main technical challenge addressed in this paper consists in finding a set of SQL update queries that is minimal in size and at the same time fixes the largest number of errors in the data. We formalize this problem as a search in a lattice-shaped space. To guarantee that the chosen updates are semantically correct, Falcon navigates the lattice by interacting with users to gradually validate the set of SQL update queries. Besides using traditional one-hop based traverse algorithms (e.g., BFS or DFS), we describe novel multi-hop search algorithms such that Falcon can dive over the lattice and conduct the search efficiently. Our novel search strategy is coupled with a number of optimization techniques to further prune the search space and efficiently maintain the lattice. We have conducted extensive experiments using both real-world and synthetic datasets to show that Falcon can effectively communicate with users in data repairing.",
keywords = "Data cleaning, Declarative, Deterministic, Interactive",
author = "Jian He and Enzo Veltri and Donatello Santoro and Guoliang Li and Giansalvatore Mecca and Paolo Papotti and Nan Tang",
year = "2016",
month = "6",
day = "26",
doi = "10.1145/2882903.2915242",
language = "English",
volume = "26-June-2016",
pages = "893--907",
booktitle = "SIGMOD 2016 - Proceedings of the 2016 International Conference on Management of Data",
publisher = "Association for Computing Machinery",

}

TY - GEN

T1 - Interactive and deterministic data cleaning

T2 - A tossed stone raises a thousand ripples

AU - He, Jian

AU - Veltri, Enzo

AU - Santoro, Donatello

AU - Li, Guoliang

AU - Mecca, Giansalvatore

AU - Papotti, Paolo

AU - Tang, Nan

PY - 2016/6/26

Y1 - 2016/6/26

N2 - We present Falcon, an interactive, deterministic, and declarative data cleaning system, which uses SQL update queries as the language to repair data. Falcon does not rely on the existence of a set of pre-defined data quality rules. On the contrary, it encourages users to explore the data, identify possible problems, and make updates to fix them. Bootstrapped by one user update, Falcon guesses a set of possible SQL update queries that can be used to repair the data. The main technical challenge addressed in this paper consists in finding a set of SQL update queries that is minimal in size and at the same time fixes the largest number of errors in the data. We formalize this problem as a search in a lattice-shaped space. To guarantee that the chosen updates are semantically correct, Falcon navigates the lattice by interacting with users to gradually validate the set of SQL update queries. Besides using traditional one-hop based traverse algorithms (e.g., BFS or DFS), we describe novel multi-hop search algorithms such that Falcon can dive over the lattice and conduct the search efficiently. Our novel search strategy is coupled with a number of optimization techniques to further prune the search space and efficiently maintain the lattice. We have conducted extensive experiments using both real-world and synthetic datasets to show that Falcon can effectively communicate with users in data repairing.

AB - We present Falcon, an interactive, deterministic, and declarative data cleaning system, which uses SQL update queries as the language to repair data. Falcon does not rely on the existence of a set of pre-defined data quality rules. On the contrary, it encourages users to explore the data, identify possible problems, and make updates to fix them. Bootstrapped by one user update, Falcon guesses a set of possible SQL update queries that can be used to repair the data. The main technical challenge addressed in this paper consists in finding a set of SQL update queries that is minimal in size and at the same time fixes the largest number of errors in the data. We formalize this problem as a search in a lattice-shaped space. To guarantee that the chosen updates are semantically correct, Falcon navigates the lattice by interacting with users to gradually validate the set of SQL update queries. Besides using traditional one-hop based traverse algorithms (e.g., BFS or DFS), we describe novel multi-hop search algorithms such that Falcon can dive over the lattice and conduct the search efficiently. Our novel search strategy is coupled with a number of optimization techniques to further prune the search space and efficiently maintain the lattice. We have conducted extensive experiments using both real-world and synthetic datasets to show that Falcon can effectively communicate with users in data repairing.

KW - Data cleaning

KW - Declarative

KW - Deterministic

KW - Interactive

UR - http://www.scopus.com/inward/record.url?scp=84979711032&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=84979711032&partnerID=8YFLogxK

U2 - 10.1145/2882903.2915242

DO - 10.1145/2882903.2915242

M3 - Conference contribution

AN - SCOPUS:84979711032

VL - 26-June-2016

SP - 893

EP - 907

BT - SIGMOD 2016 - Proceedings of the 2016 International Conference on Management of Data

PB - Association for Computing Machinery

ER -