Incremental detection of inconsistencies in distributed data

Wenfei Fan, Jianzhong Li, Nan Tang, Wenyuan Yu Qa

Research output: Contribution to journalArticle

23 Citations (Scopus)

Abstract

This paper investigates incremental detection of errors in distributed data. Given a distributed database \(D\) , a set \(\Sigma \) of conditional functional dependencies (CFDs), the set \( { {\mathsf {V}}}\) of violations of the CFDs in \(D\) , and updates \( \Delta {D}\) to \(D\) , it is to find, with minimum data shipment, changes \( \Delta { {\mathsf {V}}}\) to \( { {\mathsf {V}}}\) in response to \( \Delta {D}\). The need for the study is evident since real-life data is often dirty, distributed and frequently updated. It is often prohibitively expensive to recompute the entire set of violations when \(D\) is updated. We show that the incremental detection problem is NP-complete for database \(D\) that is partitioned either vertically or horizontally, even when \(\Sigma \) and \(D\) are fixed. Nevertheless, we show that it is bounded: there exist algorithms to detect errors such that their computational cost and data shipment are both linear in the size of \( \Delta {D}\) and \( \Delta { {\mathsf {V}}}\) , independent of the size of the database \(D\). We provide such incremental algorithms for vertically partitioned data and horizontally partitioned data, and show that the algorithms are optimal. We further propose optimization techniques for the incremental algorithm over vertical partitions to reduce data shipment. We verify experimentally, using real-life data on Amazon Elastic Compute Cloud (EC2), that our algorithms substantially outperform their batch counterparts.

Original languageEnglish
Article number6243140
Pages (from-to)1367-1383
Number of pages17
JournalIEEE Transactions on Knowledge and Data Engineering
Volume26
Issue number6
DOIs
Publication statusPublished - 1 Jan 2014

Fingerprint

Computational complexity
Costs

Keywords

  • Data
  • Data dependencies
  • General
  • Miscellaneous

ASJC Scopus subject areas

  • Computational Theory and Mathematics
  • Information Systems
  • Computer Science Applications

Cite this

Incremental detection of inconsistencies in distributed data. / Fan, Wenfei; Li, Jianzhong; Tang, Nan; Qa, Wenyuan Yu.

In: IEEE Transactions on Knowledge and Data Engineering, Vol. 26, No. 6, 6243140, 01.01.2014, p. 1367-1383.

Research output: Contribution to journalArticle

Fan, Wenfei ; Li, Jianzhong ; Tang, Nan ; Qa, Wenyuan Yu. / Incremental detection of inconsistencies in distributed data. In: IEEE Transactions on Knowledge and Data Engineering. 2014 ; Vol. 26, No. 6. pp. 1367-1383.
@article{3eb4e7e0a5cc417288f87403d894dfb5,
title = "Incremental detection of inconsistencies in distributed data",
abstract = "This paper investigates incremental detection of errors in distributed data. Given a distributed database \(D\) , a set \(\Sigma \) of conditional functional dependencies (CFDs), the set \( { {\mathsf {V}}}\) of violations of the CFDs in \(D\) , and updates \( \Delta {D}\) to \(D\) , it is to find, with minimum data shipment, changes \( \Delta { {\mathsf {V}}}\) to \( { {\mathsf {V}}}\) in response to \( \Delta {D}\). The need for the study is evident since real-life data is often dirty, distributed and frequently updated. It is often prohibitively expensive to recompute the entire set of violations when \(D\) is updated. We show that the incremental detection problem is NP-complete for database \(D\) that is partitioned either vertically or horizontally, even when \(\Sigma \) and \(D\) are fixed. Nevertheless, we show that it is bounded: there exist algorithms to detect errors such that their computational cost and data shipment are both linear in the size of \( \Delta {D}\) and \( \Delta { {\mathsf {V}}}\) , independent of the size of the database \(D\). We provide such incremental algorithms for vertically partitioned data and horizontally partitioned data, and show that the algorithms are optimal. We further propose optimization techniques for the incremental algorithm over vertical partitions to reduce data shipment. We verify experimentally, using real-life data on Amazon Elastic Compute Cloud (EC2), that our algorithms substantially outperform their batch counterparts.",
keywords = "Data, Data dependencies, General, Miscellaneous",
author = "Wenfei Fan and Jianzhong Li and Nan Tang and Qa, {Wenyuan Yu}",
year = "2014",
month = "1",
day = "1",
doi = "10.1109/TKDE.2012.138",
language = "English",
volume = "26",
pages = "1367--1383",
journal = "IEEE Transactions on Knowledge and Data Engineering",
issn = "1041-4347",
publisher = "IEEE Computer Society",
number = "6",

}

TY - JOUR

T1 - Incremental detection of inconsistencies in distributed data

AU - Fan, Wenfei

AU - Li, Jianzhong

AU - Tang, Nan

AU - Qa, Wenyuan Yu

PY - 2014/1/1

Y1 - 2014/1/1

N2 - This paper investigates incremental detection of errors in distributed data. Given a distributed database \(D\) , a set \(\Sigma \) of conditional functional dependencies (CFDs), the set \( { {\mathsf {V}}}\) of violations of the CFDs in \(D\) , and updates \( \Delta {D}\) to \(D\) , it is to find, with minimum data shipment, changes \( \Delta { {\mathsf {V}}}\) to \( { {\mathsf {V}}}\) in response to \( \Delta {D}\). The need for the study is evident since real-life data is often dirty, distributed and frequently updated. It is often prohibitively expensive to recompute the entire set of violations when \(D\) is updated. We show that the incremental detection problem is NP-complete for database \(D\) that is partitioned either vertically or horizontally, even when \(\Sigma \) and \(D\) are fixed. Nevertheless, we show that it is bounded: there exist algorithms to detect errors such that their computational cost and data shipment are both linear in the size of \( \Delta {D}\) and \( \Delta { {\mathsf {V}}}\) , independent of the size of the database \(D\). We provide such incremental algorithms for vertically partitioned data and horizontally partitioned data, and show that the algorithms are optimal. We further propose optimization techniques for the incremental algorithm over vertical partitions to reduce data shipment. We verify experimentally, using real-life data on Amazon Elastic Compute Cloud (EC2), that our algorithms substantially outperform their batch counterparts.

AB - This paper investigates incremental detection of errors in distributed data. Given a distributed database \(D\) , a set \(\Sigma \) of conditional functional dependencies (CFDs), the set \( { {\mathsf {V}}}\) of violations of the CFDs in \(D\) , and updates \( \Delta {D}\) to \(D\) , it is to find, with minimum data shipment, changes \( \Delta { {\mathsf {V}}}\) to \( { {\mathsf {V}}}\) in response to \( \Delta {D}\). The need for the study is evident since real-life data is often dirty, distributed and frequently updated. It is often prohibitively expensive to recompute the entire set of violations when \(D\) is updated. We show that the incremental detection problem is NP-complete for database \(D\) that is partitioned either vertically or horizontally, even when \(\Sigma \) and \(D\) are fixed. Nevertheless, we show that it is bounded: there exist algorithms to detect errors such that their computational cost and data shipment are both linear in the size of \( \Delta {D}\) and \( \Delta { {\mathsf {V}}}\) , independent of the size of the database \(D\). We provide such incremental algorithms for vertically partitioned data and horizontally partitioned data, and show that the algorithms are optimal. We further propose optimization techniques for the incremental algorithm over vertical partitions to reduce data shipment. We verify experimentally, using real-life data on Amazon Elastic Compute Cloud (EC2), that our algorithms substantially outperform their batch counterparts.

KW - Data

KW - Data dependencies

KW - General

KW - Miscellaneous

UR - http://www.scopus.com/inward/record.url?scp=84902202624&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=84902202624&partnerID=8YFLogxK

U2 - 10.1109/TKDE.2012.138

DO - 10.1109/TKDE.2012.138

M3 - Article

VL - 26

SP - 1367

EP - 1383

JO - IEEE Transactions on Knowledge and Data Engineering

JF - IEEE Transactions on Knowledge and Data Engineering

SN - 1041-4347

IS - 6

M1 - 6243140

ER -