Interaction between record matching and data repairing

Wenfei Fan, Shuai Ma, Nan Tang, Wenyuan Yu

Research output: Contribution to journalArticle

26 Citations (Scopus)

Abstract

Central to a data cleaning system are record matching and data repairing. Matching aims to identify tuples that refer to the same real-world object, and repairing is to make a database consistent by fixing errors in the data by using integrity constraints. These are typically treated as separate processes in current data cleaning systems, based on heuristic solutions. This article studies a new problem in connection with data cleaning, namely the interaction between record matching and data repairing. We show that repairing can effectively help us identify matches, and vice versa. To capture the interaction, we provide a uniform framework that seamlessly unifies repairing and matching operations to clean a database based on integrity constraints, matching rules, and master data. We give a full treatment of fundamental problems associated with data cleaning via matching and repairing, including the static analyses of constraints and rules taken together, and the complexity, termination, and determinism analyses of data cleaning. We show that these problems are hard, ranging from NP-complete or coNP-complete, to PSPACE-complete. Nevertheless, we propose efficient algorithms to clean data via both matching and repairing. The algorithms find deterministic fixes and reliable fixes based on confidence and entropy analyses, respectively, which are more accurate than fixes generated by heuristics. Heuristic fixes are produced only when deterministic or reliable fixes are unavailable. We experimentally verify that our techniques can significantly improve the accuracy of record matching and data repairing that are taken as separate processes, using real-life and synthetic data.

Original languageEnglish
Article number16
JournalJournal of Data and Information Quality
Volume4
Issue number4
DOIs
Publication statusPublished - 1 Jan 2014

Fingerprint

Cleaning
Entropy
Interaction
Data cleaning
Heuristics
Integrity
Data base

Keywords

  • Conditional functional dependency
  • Data repairing
  • Matching dependency
  • Record matching

ASJC Scopus subject areas

  • Information Systems and Management
  • Information Systems

Cite this

Interaction between record matching and data repairing. / Fan, Wenfei; Ma, Shuai; Tang, Nan; Yu, Wenyuan.

In: Journal of Data and Information Quality, Vol. 4, No. 4, 16, 01.01.2014.

Research output: Contribution to journalArticle

Fan, Wenfei ; Ma, Shuai ; Tang, Nan ; Yu, Wenyuan. / Interaction between record matching and data repairing. In: Journal of Data and Information Quality. 2014 ; Vol. 4, No. 4.
@article{48dc2f04d4bd4dc3ba1b5b3fee4a2733,
title = "Interaction between record matching and data repairing",
abstract = "Central to a data cleaning system are record matching and data repairing. Matching aims to identify tuples that refer to the same real-world object, and repairing is to make a database consistent by fixing errors in the data by using integrity constraints. These are typically treated as separate processes in current data cleaning systems, based on heuristic solutions. This article studies a new problem in connection with data cleaning, namely the interaction between record matching and data repairing. We show that repairing can effectively help us identify matches, and vice versa. To capture the interaction, we provide a uniform framework that seamlessly unifies repairing and matching operations to clean a database based on integrity constraints, matching rules, and master data. We give a full treatment of fundamental problems associated with data cleaning via matching and repairing, including the static analyses of constraints and rules taken together, and the complexity, termination, and determinism analyses of data cleaning. We show that these problems are hard, ranging from NP-complete or coNP-complete, to PSPACE-complete. Nevertheless, we propose efficient algorithms to clean data via both matching and repairing. The algorithms find deterministic fixes and reliable fixes based on confidence and entropy analyses, respectively, which are more accurate than fixes generated by heuristics. Heuristic fixes are produced only when deterministic or reliable fixes are unavailable. We experimentally verify that our techniques can significantly improve the accuracy of record matching and data repairing that are taken as separate processes, using real-life and synthetic data.",
keywords = "Conditional functional dependency, Data repairing, Matching dependency, Record matching",
author = "Wenfei Fan and Shuai Ma and Nan Tang and Wenyuan Yu",
year = "2014",
month = "1",
day = "1",
doi = "10.1145/2567657",
language = "English",
volume = "4",
journal = "Journal of Data and Information Quality",
issn = "1936-1955",
publisher = "Association for Computing Machinery (ACM)",
number = "4",

}

TY - JOUR

T1 - Interaction between record matching and data repairing

AU - Fan, Wenfei

AU - Ma, Shuai

AU - Tang, Nan

AU - Yu, Wenyuan

PY - 2014/1/1

Y1 - 2014/1/1

N2 - Central to a data cleaning system are record matching and data repairing. Matching aims to identify tuples that refer to the same real-world object, and repairing is to make a database consistent by fixing errors in the data by using integrity constraints. These are typically treated as separate processes in current data cleaning systems, based on heuristic solutions. This article studies a new problem in connection with data cleaning, namely the interaction between record matching and data repairing. We show that repairing can effectively help us identify matches, and vice versa. To capture the interaction, we provide a uniform framework that seamlessly unifies repairing and matching operations to clean a database based on integrity constraints, matching rules, and master data. We give a full treatment of fundamental problems associated with data cleaning via matching and repairing, including the static analyses of constraints and rules taken together, and the complexity, termination, and determinism analyses of data cleaning. We show that these problems are hard, ranging from NP-complete or coNP-complete, to PSPACE-complete. Nevertheless, we propose efficient algorithms to clean data via both matching and repairing. The algorithms find deterministic fixes and reliable fixes based on confidence and entropy analyses, respectively, which are more accurate than fixes generated by heuristics. Heuristic fixes are produced only when deterministic or reliable fixes are unavailable. We experimentally verify that our techniques can significantly improve the accuracy of record matching and data repairing that are taken as separate processes, using real-life and synthetic data.

AB - Central to a data cleaning system are record matching and data repairing. Matching aims to identify tuples that refer to the same real-world object, and repairing is to make a database consistent by fixing errors in the data by using integrity constraints. These are typically treated as separate processes in current data cleaning systems, based on heuristic solutions. This article studies a new problem in connection with data cleaning, namely the interaction between record matching and data repairing. We show that repairing can effectively help us identify matches, and vice versa. To capture the interaction, we provide a uniform framework that seamlessly unifies repairing and matching operations to clean a database based on integrity constraints, matching rules, and master data. We give a full treatment of fundamental problems associated with data cleaning via matching and repairing, including the static analyses of constraints and rules taken together, and the complexity, termination, and determinism analyses of data cleaning. We show that these problems are hard, ranging from NP-complete or coNP-complete, to PSPACE-complete. Nevertheless, we propose efficient algorithms to clean data via both matching and repairing. The algorithms find deterministic fixes and reliable fixes based on confidence and entropy analyses, respectively, which are more accurate than fixes generated by heuristics. Heuristic fixes are produced only when deterministic or reliable fixes are unavailable. We experimentally verify that our techniques can significantly improve the accuracy of record matching and data repairing that are taken as separate processes, using real-life and synthetic data.

KW - Conditional functional dependency

KW - Data repairing

KW - Matching dependency

KW - Record matching

UR - http://www.scopus.com/inward/record.url?scp=84901490949&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=84901490949&partnerID=8YFLogxK

U2 - 10.1145/2567657

DO - 10.1145/2567657

M3 - Article

VL - 4

JO - Journal of Data and Information Quality

JF - Journal of Data and Information Quality

SN - 1936-1955

IS - 4

M1 - 16

ER -