A Novel Cost-Based Model for Data Repairing

Shuang Hao, Nan Tang, Guoliang Li, Jian He, Na Ta, Jianhua Feng

Research output: Contribution to journalArticle

5 Citations (Scopus)

Abstract

Integrity constraint based data repairing is an iterative process consisting of two parts: detect and group errors that violate given integrity constraints (ICs); and modify values inside each group such that the modified database satisfies those ICs. However, most existing automatic solutions treat the process of detecting and grouping errors straightforwardly (e.g., violations of functional dependencies using string equality), while putting more attention on heuristics of modifying values within each group. In this paper, we propose a revised semantics of violations and data consistency w.r.t. a set of ICs. The revised semantics relies on string similarities, in contrast to traditional methods that use syntactic error detection using string equality. Along with the revised semantics, we also propose a new cost model to quantify the cost of data repair by considering distances between strings. We show that the revised semantics provides a significant change for better detecting and grouping errors, which in turn improves both precision and recall of the following data repairing step. We prove that finding minimum-cost repairs in the new model is NP-hard, even for a single FD. We devise efficient algorithms to find approximate repairs. In addition, we develop indices and optimization techniques to improve the efficiency. Experiments show that our approach significantly outperforms existing automatic repair algorithms in both precision and recall.

Original languageEnglish
Article number7779087
Pages (from-to)727-742
Number of pages16
JournalIEEE Transactions on Knowledge and Data Engineering
Volume29
Issue number4
DOIs
Publication statusPublished - 1 Apr 2017

Fingerprint

Repair
Semantics
Costs
Error detection
Syntactics
Experiments

Keywords

  • Data repairing
  • fault-tolerant violation
  • functional dependencies
  • graph model
  • maximal independent set

ASJC Scopus subject areas

  • Information Systems
  • Computer Science Applications
  • Computational Theory and Mathematics

Cite this

A Novel Cost-Based Model for Data Repairing. / Hao, Shuang; Tang, Nan; Li, Guoliang; He, Jian; Ta, Na; Feng, Jianhua.

In: IEEE Transactions on Knowledge and Data Engineering, Vol. 29, No. 4, 7779087, 01.04.2017, p. 727-742.

Research output: Contribution to journalArticle

Hao, Shuang ; Tang, Nan ; Li, Guoliang ; He, Jian ; Ta, Na ; Feng, Jianhua. / A Novel Cost-Based Model for Data Repairing. In: IEEE Transactions on Knowledge and Data Engineering. 2017 ; Vol. 29, No. 4. pp. 727-742.
@article{b9de6308a2cd44238dacd100568ba414,
title = "A Novel Cost-Based Model for Data Repairing",
abstract = "Integrity constraint based data repairing is an iterative process consisting of two parts: detect and group errors that violate given integrity constraints (ICs); and modify values inside each group such that the modified database satisfies those ICs. However, most existing automatic solutions treat the process of detecting and grouping errors straightforwardly (e.g., violations of functional dependencies using string equality), while putting more attention on heuristics of modifying values within each group. In this paper, we propose a revised semantics of violations and data consistency w.r.t. a set of ICs. The revised semantics relies on string similarities, in contrast to traditional methods that use syntactic error detection using string equality. Along with the revised semantics, we also propose a new cost model to quantify the cost of data repair by considering distances between strings. We show that the revised semantics provides a significant change for better detecting and grouping errors, which in turn improves both precision and recall of the following data repairing step. We prove that finding minimum-cost repairs in the new model is NP-hard, even for a single FD. We devise efficient algorithms to find approximate repairs. In addition, we develop indices and optimization techniques to improve the efficiency. Experiments show that our approach significantly outperforms existing automatic repair algorithms in both precision and recall.",
keywords = "Data repairing, fault-tolerant violation, functional dependencies, graph model, maximal independent set",
author = "Shuang Hao and Nan Tang and Guoliang Li and Jian He and Na Ta and Jianhua Feng",
year = "2017",
month = "4",
day = "1",
doi = "10.1109/TKDE.2016.2637928",
language = "English",
volume = "29",
pages = "727--742",
journal = "IEEE Transactions on Knowledge and Data Engineering",
issn = "1041-4347",
publisher = "IEEE Computer Society",
number = "4",

}

TY - JOUR

T1 - A Novel Cost-Based Model for Data Repairing

AU - Hao, Shuang

AU - Tang, Nan

AU - Li, Guoliang

AU - He, Jian

AU - Ta, Na

AU - Feng, Jianhua

PY - 2017/4/1

Y1 - 2017/4/1

N2 - Integrity constraint based data repairing is an iterative process consisting of two parts: detect and group errors that violate given integrity constraints (ICs); and modify values inside each group such that the modified database satisfies those ICs. However, most existing automatic solutions treat the process of detecting and grouping errors straightforwardly (e.g., violations of functional dependencies using string equality), while putting more attention on heuristics of modifying values within each group. In this paper, we propose a revised semantics of violations and data consistency w.r.t. a set of ICs. The revised semantics relies on string similarities, in contrast to traditional methods that use syntactic error detection using string equality. Along with the revised semantics, we also propose a new cost model to quantify the cost of data repair by considering distances between strings. We show that the revised semantics provides a significant change for better detecting and grouping errors, which in turn improves both precision and recall of the following data repairing step. We prove that finding minimum-cost repairs in the new model is NP-hard, even for a single FD. We devise efficient algorithms to find approximate repairs. In addition, we develop indices and optimization techniques to improve the efficiency. Experiments show that our approach significantly outperforms existing automatic repair algorithms in both precision and recall.

AB - Integrity constraint based data repairing is an iterative process consisting of two parts: detect and group errors that violate given integrity constraints (ICs); and modify values inside each group such that the modified database satisfies those ICs. However, most existing automatic solutions treat the process of detecting and grouping errors straightforwardly (e.g., violations of functional dependencies using string equality), while putting more attention on heuristics of modifying values within each group. In this paper, we propose a revised semantics of violations and data consistency w.r.t. a set of ICs. The revised semantics relies on string similarities, in contrast to traditional methods that use syntactic error detection using string equality. Along with the revised semantics, we also propose a new cost model to quantify the cost of data repair by considering distances between strings. We show that the revised semantics provides a significant change for better detecting and grouping errors, which in turn improves both precision and recall of the following data repairing step. We prove that finding minimum-cost repairs in the new model is NP-hard, even for a single FD. We devise efficient algorithms to find approximate repairs. In addition, we develop indices and optimization techniques to improve the efficiency. Experiments show that our approach significantly outperforms existing automatic repair algorithms in both precision and recall.

KW - Data repairing

KW - fault-tolerant violation

KW - functional dependencies

KW - graph model

KW - maximal independent set

UR - http://www.scopus.com/inward/record.url?scp=85015972188&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=85015972188&partnerID=8YFLogxK

U2 - 10.1109/TKDE.2016.2637928

DO - 10.1109/TKDE.2016.2637928

M3 - Article

VL - 29

SP - 727

EP - 742

JO - IEEE Transactions on Knowledge and Data Engineering

JF - IEEE Transactions on Knowledge and Data Engineering

SN - 1041-4347

IS - 4

M1 - 7779087

ER -