Interaction between record matching and data repairing

Wenfei Fan, Jianzhong Li, Shuai Ma, Nan Tang, Wenyuan Yu

Research output: Chapter in Book/Report/Conference proceedingConference contribution

65 Citations (Scopus)

Abstract

Central to a data cleaning system are record matching and data repairing. Matching aims to identify tuples that refer to the same real-world object, and repairing is to make a database consistent by fixing errors in the data by using constraints. These are treated as separate processes in current data cleaning systems, based on heuristic solutions. This paper studies a new problem, namely, the interaction between record matching and data repairing. We show that repairing can effectively help us identify matches, and vice versa. To capture the interaction, we propose a uniform framework that seamlessly unifies repairing and matching operations, to clean a database based on integrity constraints, matching rules and master data. We give a full treatment of fundamental problems associated with data cleaning via matching and repairing, including the static analyses of constraints and rules taken together, and the complexity, termination and determinism analyses of data cleaning. We show that these problems are hard, ranging from NP- or coNP-complete, to PSPACE-complete. Nevertheless, we propose efficient algorithms to clean data via both matching and repairing. The algorithms find deterministic fixes and reliable fixes based on confidence and entropy analysis, respectively, which are more accurate than possible fixes generated by heuristics. We experimentally verify that our techniques significantly improve the accuracy of record matching and data repairing taken as separate processes, using real-life data.

Original languageEnglish
Title of host publicationProceedings of the ACM SIGMOD International Conference on Management of Data
Pages469-480
Number of pages12
DOIs
Publication statusPublished - 11 Jul 2011
Externally publishedYes
Event2011 ACM SIGMOD and 30th PODS 2011 Conference - Athens, Greece
Duration: 12 Jun 201116 Jun 2011

Other

Other2011 ACM SIGMOD and 30th PODS 2011 Conference
CountryGreece
CityAthens
Period12/6/1116/6/11

Fingerprint

Cleaning
Entropy

Keywords

  • conditional functional dependency
  • data cleaning
  • matching dependency

ASJC Scopus subject areas

  • Software
  • Information Systems

Cite this

Fan, W., Li, J., Ma, S., Tang, N., & Yu, W. (2011). Interaction between record matching and data repairing. In Proceedings of the ACM SIGMOD International Conference on Management of Data (pp. 469-480) https://doi.org/10.1145/1989323.1989373

Interaction between record matching and data repairing. / Fan, Wenfei; Li, Jianzhong; Ma, Shuai; Tang, Nan; Yu, Wenyuan.

Proceedings of the ACM SIGMOD International Conference on Management of Data. 2011. p. 469-480.

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Fan, W, Li, J, Ma, S, Tang, N & Yu, W 2011, Interaction between record matching and data repairing. in Proceedings of the ACM SIGMOD International Conference on Management of Data. pp. 469-480, 2011 ACM SIGMOD and 30th PODS 2011 Conference, Athens, Greece, 12/6/11. https://doi.org/10.1145/1989323.1989373
Fan W, Li J, Ma S, Tang N, Yu W. Interaction between record matching and data repairing. In Proceedings of the ACM SIGMOD International Conference on Management of Data. 2011. p. 469-480 https://doi.org/10.1145/1989323.1989373
Fan, Wenfei ; Li, Jianzhong ; Ma, Shuai ; Tang, Nan ; Yu, Wenyuan. / Interaction between record matching and data repairing. Proceedings of the ACM SIGMOD International Conference on Management of Data. 2011. pp. 469-480
@inproceedings{221b5ed0e4974271abd089d345bfb08f,
title = "Interaction between record matching and data repairing",
abstract = "Central to a data cleaning system are record matching and data repairing. Matching aims to identify tuples that refer to the same real-world object, and repairing is to make a database consistent by fixing errors in the data by using constraints. These are treated as separate processes in current data cleaning systems, based on heuristic solutions. This paper studies a new problem, namely, the interaction between record matching and data repairing. We show that repairing can effectively help us identify matches, and vice versa. To capture the interaction, we propose a uniform framework that seamlessly unifies repairing and matching operations, to clean a database based on integrity constraints, matching rules and master data. We give a full treatment of fundamental problems associated with data cleaning via matching and repairing, including the static analyses of constraints and rules taken together, and the complexity, termination and determinism analyses of data cleaning. We show that these problems are hard, ranging from NP- or coNP-complete, to PSPACE-complete. Nevertheless, we propose efficient algorithms to clean data via both matching and repairing. The algorithms find deterministic fixes and reliable fixes based on confidence and entropy analysis, respectively, which are more accurate than possible fixes generated by heuristics. We experimentally verify that our techniques significantly improve the accuracy of record matching and data repairing taken as separate processes, using real-life data.",
keywords = "conditional functional dependency, data cleaning, matching dependency",
author = "Wenfei Fan and Jianzhong Li and Shuai Ma and Nan Tang and Wenyuan Yu",
year = "2011",
month = "7",
day = "11",
doi = "10.1145/1989323.1989373",
language = "English",
isbn = "9781450306614",
pages = "469--480",
booktitle = "Proceedings of the ACM SIGMOD International Conference on Management of Data",

}

TY - GEN

T1 - Interaction between record matching and data repairing

AU - Fan, Wenfei

AU - Li, Jianzhong

AU - Ma, Shuai

AU - Tang, Nan

AU - Yu, Wenyuan

PY - 2011/7/11

Y1 - 2011/7/11

N2 - Central to a data cleaning system are record matching and data repairing. Matching aims to identify tuples that refer to the same real-world object, and repairing is to make a database consistent by fixing errors in the data by using constraints. These are treated as separate processes in current data cleaning systems, based on heuristic solutions. This paper studies a new problem, namely, the interaction between record matching and data repairing. We show that repairing can effectively help us identify matches, and vice versa. To capture the interaction, we propose a uniform framework that seamlessly unifies repairing and matching operations, to clean a database based on integrity constraints, matching rules and master data. We give a full treatment of fundamental problems associated with data cleaning via matching and repairing, including the static analyses of constraints and rules taken together, and the complexity, termination and determinism analyses of data cleaning. We show that these problems are hard, ranging from NP- or coNP-complete, to PSPACE-complete. Nevertheless, we propose efficient algorithms to clean data via both matching and repairing. The algorithms find deterministic fixes and reliable fixes based on confidence and entropy analysis, respectively, which are more accurate than possible fixes generated by heuristics. We experimentally verify that our techniques significantly improve the accuracy of record matching and data repairing taken as separate processes, using real-life data.

AB - Central to a data cleaning system are record matching and data repairing. Matching aims to identify tuples that refer to the same real-world object, and repairing is to make a database consistent by fixing errors in the data by using constraints. These are treated as separate processes in current data cleaning systems, based on heuristic solutions. This paper studies a new problem, namely, the interaction between record matching and data repairing. We show that repairing can effectively help us identify matches, and vice versa. To capture the interaction, we propose a uniform framework that seamlessly unifies repairing and matching operations, to clean a database based on integrity constraints, matching rules and master data. We give a full treatment of fundamental problems associated with data cleaning via matching and repairing, including the static analyses of constraints and rules taken together, and the complexity, termination and determinism analyses of data cleaning. We show that these problems are hard, ranging from NP- or coNP-complete, to PSPACE-complete. Nevertheless, we propose efficient algorithms to clean data via both matching and repairing. The algorithms find deterministic fixes and reliable fixes based on confidence and entropy analysis, respectively, which are more accurate than possible fixes generated by heuristics. We experimentally verify that our techniques significantly improve the accuracy of record matching and data repairing taken as separate processes, using real-life data.

KW - conditional functional dependency

KW - data cleaning

KW - matching dependency

UR - http://www.scopus.com/inward/record.url?scp=79959944062&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=79959944062&partnerID=8YFLogxK

U2 - 10.1145/1989323.1989373

DO - 10.1145/1989323.1989373

M3 - Conference contribution

SN - 9781450306614

SP - 469

EP - 480

BT - Proceedings of the ACM SIGMOD International Conference on Management of Data

ER -