A novel cost-based model for data repairing (Extended abstract)

Shuang Hao, Nan Tang, Guoliang Li, Jian He, Na Ta, Jianhua Feng

Research output: Chapter in Book/Report/Conference proceedingConference contribution

1 Citation (Scopus)

Abstract

Integrity constraint (IC) based data repairing is typically an iterative process consisting of two parts: detecting and grouping errors that violate given ICs; and modifying values inside each group such that the modified database satisfies those ICs. However, most existing automatic solutions treat the process of detecting and grouping errors straightforwardly (e.g., violations of functional dependencies using string equality), while putting more attention on heuristics of modifying values within each group. In this paper, we propose a revised semantics of violations and data consistency w.r.t. a set of ICs. The revised semantics relies on string similarities, in contrast to traditional methods that use syntactic error detection using string equality. Along with the revised semantics, we also propose a new cost model to quantify the cost of data repairing by considering distances between strings. We show that the revised semantics provides a significant change for better detecting and grouping errors, which in turn improves both precision and recall of the following data repairing step. We prove that finding minimumcost repairs in the new model is NP-hard, even for a single FD. We devise efficient algorithms to find approximate repairs.

Original languageEnglish
Title of host publicationProceedings - 2017 IEEE 33rd International Conference on Data Engineering, ICDE 2017
PublisherIEEE Computer Society
Pages49-50
Number of pages2
ISBN (Electronic)9781509065431
DOIs
Publication statusPublished - 16 May 2017
Event33rd IEEE International Conference on Data Engineering, ICDE 2017 - San Diego, United States
Duration: 19 Apr 201722 Apr 2017

Other

Other33rd IEEE International Conference on Data Engineering, ICDE 2017
CountryUnited States
CitySan Diego
Period19/4/1722/4/17

    Fingerprint

ASJC Scopus subject areas

  • Software
  • Signal Processing
  • Information Systems

Cite this

Hao, S., Tang, N., Li, G., He, J., Ta, N., & Feng, J. (2017). A novel cost-based model for data repairing (Extended abstract). In Proceedings - 2017 IEEE 33rd International Conference on Data Engineering, ICDE 2017 (pp. 49-50). [7929927] IEEE Computer Society. https://doi.org/10.1109/ICDE.2017.31