Incremental detection of inconsistencies in distributed data

Wenfei Fan, Jianzhong Li, Nan Tang, Wenyuan Yu

Research output: Chapter in Book/Report/Conference proceedingConference contribution

24 Citations (Scopus)

Abstract

This paper investigates the problem of incremental detection of errors in distributed data. Given a distributed database D, a set Σ of conditional functional dependencies (CFDs), the set V of violations of the CFDs in D, and updates Δ D to D, it is to find, with minimum data shipment, changes Δ V to V in response to Δ D. The need for the study is evident since real-life data is often dirty, distributed and is frequently updated. It is often prohibitively expensive to recompute the entire set of violations when D is updated. We show that the incremental detection problem is NP-complete for D partitioned either vertically or horizontally, even when Σ and D are fixed. Nevertheless, we show that it is bounded and better still, actually optimal: there exist algorithms to detect errors such that their computational cost and data shipment are both linear in the size of Δ D and Δ V, independent of the size of the database D. We provide such incremental algorithms for vertically partitioned data, and show that the algorithms are optimal. We further propose optimization techniques for the incremental algorithm over vertical partitions to reduce data shipment. We verify experimentally, using real-life data on Amazon Elastic Compute Cloud (EC2), that our algorithms substantially outperform their batch counterparts even when Δ V is reasonably large.

Original languageEnglish
Title of host publicationProceedings - International Conference on Data Engineering
Pages318-329
Number of pages12
DOIs
Publication statusPublished - 2012
EventIEEE 28th International Conference on Data Engineering, ICDE 2012 - Arlington, VA, United States
Duration: 1 Apr 20125 Apr 2012

Other

OtherIEEE 28th International Conference on Data Engineering, ICDE 2012
CountryUnited States
CityArlington, VA
Period1/4/125/4/12

Fingerprint

Computational complexity
Costs

ASJC Scopus subject areas

  • Information Systems
  • Signal Processing
  • Software

Cite this

Fan, W., Li, J., Tang, N., & Yu, W. (2012). Incremental detection of inconsistencies in distributed data. In Proceedings - International Conference on Data Engineering (pp. 318-329). [6228094] https://doi.org/10.1109/ICDE.2012.82

Incremental detection of inconsistencies in distributed data. / Fan, Wenfei; Li, Jianzhong; Tang, Nan; Yu, Wenyuan.

Proceedings - International Conference on Data Engineering. 2012. p. 318-329 6228094.

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Fan, W, Li, J, Tang, N & Yu, W 2012, Incremental detection of inconsistencies in distributed data. in Proceedings - International Conference on Data Engineering., 6228094, pp. 318-329, IEEE 28th International Conference on Data Engineering, ICDE 2012, Arlington, VA, United States, 1/4/12. https://doi.org/10.1109/ICDE.2012.82
Fan W, Li J, Tang N, Yu W. Incremental detection of inconsistencies in distributed data. In Proceedings - International Conference on Data Engineering. 2012. p. 318-329. 6228094 https://doi.org/10.1109/ICDE.2012.82
Fan, Wenfei ; Li, Jianzhong ; Tang, Nan ; Yu, Wenyuan. / Incremental detection of inconsistencies in distributed data. Proceedings - International Conference on Data Engineering. 2012. pp. 318-329
@inproceedings{4545355d93164e99bc65b8ffe0200749,
title = "Incremental detection of inconsistencies in distributed data",
abstract = "This paper investigates the problem of incremental detection of errors in distributed data. Given a distributed database D, a set Σ of conditional functional dependencies (CFDs), the set V of violations of the CFDs in D, and updates Δ D to D, it is to find, with minimum data shipment, changes Δ V to V in response to Δ D. The need for the study is evident since real-life data is often dirty, distributed and is frequently updated. It is often prohibitively expensive to recompute the entire set of violations when D is updated. We show that the incremental detection problem is NP-complete for D partitioned either vertically or horizontally, even when Σ and D are fixed. Nevertheless, we show that it is bounded and better still, actually optimal: there exist algorithms to detect errors such that their computational cost and data shipment are both linear in the size of Δ D and Δ V, independent of the size of the database D. We provide such incremental algorithms for vertically partitioned data, and show that the algorithms are optimal. We further propose optimization techniques for the incremental algorithm over vertical partitions to reduce data shipment. We verify experimentally, using real-life data on Amazon Elastic Compute Cloud (EC2), that our algorithms substantially outperform their batch counterparts even when Δ V is reasonably large.",
author = "Wenfei Fan and Jianzhong Li and Nan Tang and Wenyuan Yu",
year = "2012",
doi = "10.1109/ICDE.2012.82",
language = "English",
pages = "318--329",
booktitle = "Proceedings - International Conference on Data Engineering",

}

TY - GEN

T1 - Incremental detection of inconsistencies in distributed data

AU - Fan, Wenfei

AU - Li, Jianzhong

AU - Tang, Nan

AU - Yu, Wenyuan

PY - 2012

Y1 - 2012

N2 - This paper investigates the problem of incremental detection of errors in distributed data. Given a distributed database D, a set Σ of conditional functional dependencies (CFDs), the set V of violations of the CFDs in D, and updates Δ D to D, it is to find, with minimum data shipment, changes Δ V to V in response to Δ D. The need for the study is evident since real-life data is often dirty, distributed and is frequently updated. It is often prohibitively expensive to recompute the entire set of violations when D is updated. We show that the incremental detection problem is NP-complete for D partitioned either vertically or horizontally, even when Σ and D are fixed. Nevertheless, we show that it is bounded and better still, actually optimal: there exist algorithms to detect errors such that their computational cost and data shipment are both linear in the size of Δ D and Δ V, independent of the size of the database D. We provide such incremental algorithms for vertically partitioned data, and show that the algorithms are optimal. We further propose optimization techniques for the incremental algorithm over vertical partitions to reduce data shipment. We verify experimentally, using real-life data on Amazon Elastic Compute Cloud (EC2), that our algorithms substantially outperform their batch counterparts even when Δ V is reasonably large.

AB - This paper investigates the problem of incremental detection of errors in distributed data. Given a distributed database D, a set Σ of conditional functional dependencies (CFDs), the set V of violations of the CFDs in D, and updates Δ D to D, it is to find, with minimum data shipment, changes Δ V to V in response to Δ D. The need for the study is evident since real-life data is often dirty, distributed and is frequently updated. It is often prohibitively expensive to recompute the entire set of violations when D is updated. We show that the incremental detection problem is NP-complete for D partitioned either vertically or horizontally, even when Σ and D are fixed. Nevertheless, we show that it is bounded and better still, actually optimal: there exist algorithms to detect errors such that their computational cost and data shipment are both linear in the size of Δ D and Δ V, independent of the size of the database D. We provide such incremental algorithms for vertically partitioned data, and show that the algorithms are optimal. We further propose optimization techniques for the incremental algorithm over vertical partitions to reduce data shipment. We verify experimentally, using real-life data on Amazon Elastic Compute Cloud (EC2), that our algorithms substantially outperform their batch counterparts even when Δ V is reasonably large.

UR - http://www.scopus.com/inward/record.url?scp=84864198280&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=84864198280&partnerID=8YFLogxK

U2 - 10.1109/ICDE.2012.82

DO - 10.1109/ICDE.2012.82

M3 - Conference contribution

AN - SCOPUS:84864198280

SP - 318

EP - 329

BT - Proceedings - International Conference on Data Engineering

ER -