Truth discovery and copying detection in a dynamic world

Xin Luna Dong, Laure Berti-Equille, Divesh Srivastava

Research output: Chapter in Book/Report/Conference proceedingChapter

159 Citations (Scopus)

Abstract

Modern information management applications often require integrating data from a variety of data sources, some of which may copy or buy data from other sources. When these data sources model a dynamically changing world (e.g., people's contact information changes over time, restaurants open and go out of business), sources often provide out-of-date data. Errors can also creep into data when sources are updated often. Given out-of-date and erroneous data provided by different, possibly dependent, sources, it is challenging for data integration systems to provide the true values. Straightforward ways to resolve such inconsistencies (e.g., voting) may lead to noisy results, often with detrimental consequences. In this paper, we study the problem of finding true values and determining the copying relationship between sources, when the update history of the sources is known. We model the quality of sources over time by their coverage, exactness and freshness. Based on these measures, we conduct a probabilistic analysis. First, we develop a Hidden Markov Model that decides whether a source is a copier of another source and identifies the specific moments at which it copies. Second, we develop a Bayesian model that aggregates information from the sources to decide the true value for a data item, and the evolution of the true values over time. Experimental results on both real-world and synthetic data show high accuracy and scalability of our techniques.

Original languageEnglish
Title of host publicationProceedings of the VLDB Endowment
Pages562-573
Number of pages12
Volume2
Edition1
Publication statusPublished - 2009
Externally publishedYes

Fingerprint

Copying
Data integration
Hidden Markov models
Information management
Scalability
Creep
Industry

ASJC Scopus subject areas

  • Computer Science (miscellaneous)
  • Computer Science(all)

Cite this

Dong, X. L., Berti-Equille, L., & Srivastava, D. (2009). Truth discovery and copying detection in a dynamic world. In Proceedings of the VLDB Endowment (1 ed., Vol. 2, pp. 562-573)

Truth discovery and copying detection in a dynamic world. / Dong, Xin Luna; Berti-Equille, Laure; Srivastava, Divesh.

Proceedings of the VLDB Endowment. Vol. 2 1. ed. 2009. p. 562-573.

Research output: Chapter in Book/Report/Conference proceedingChapter

Dong, XL, Berti-Equille, L & Srivastava, D 2009, Truth discovery and copying detection in a dynamic world. in Proceedings of the VLDB Endowment. 1 edn, vol. 2, pp. 562-573.
Dong XL, Berti-Equille L, Srivastava D. Truth discovery and copying detection in a dynamic world. In Proceedings of the VLDB Endowment. 1 ed. Vol. 2. 2009. p. 562-573
Dong, Xin Luna ; Berti-Equille, Laure ; Srivastava, Divesh. / Truth discovery and copying detection in a dynamic world. Proceedings of the VLDB Endowment. Vol. 2 1. ed. 2009. pp. 562-573
@inbook{f705609b06ed44399798ab56f9ecf83b,
title = "Truth discovery and copying detection in a dynamic world",
abstract = "Modern information management applications often require integrating data from a variety of data sources, some of which may copy or buy data from other sources. When these data sources model a dynamically changing world (e.g., people's contact information changes over time, restaurants open and go out of business), sources often provide out-of-date data. Errors can also creep into data when sources are updated often. Given out-of-date and erroneous data provided by different, possibly dependent, sources, it is challenging for data integration systems to provide the true values. Straightforward ways to resolve such inconsistencies (e.g., voting) may lead to noisy results, often with detrimental consequences. In this paper, we study the problem of finding true values and determining the copying relationship between sources, when the update history of the sources is known. We model the quality of sources over time by their coverage, exactness and freshness. Based on these measures, we conduct a probabilistic analysis. First, we develop a Hidden Markov Model that decides whether a source is a copier of another source and identifies the specific moments at which it copies. Second, we develop a Bayesian model that aggregates information from the sources to decide the true value for a data item, and the evolution of the true values over time. Experimental results on both real-world and synthetic data show high accuracy and scalability of our techniques.",
author = "Dong, {Xin Luna} and Laure Berti-Equille and Divesh Srivastava",
year = "2009",
language = "English",
volume = "2",
pages = "562--573",
booktitle = "Proceedings of the VLDB Endowment",
edition = "1",

}

TY - CHAP

T1 - Truth discovery and copying detection in a dynamic world

AU - Dong, Xin Luna

AU - Berti-Equille, Laure

AU - Srivastava, Divesh

PY - 2009

Y1 - 2009

N2 - Modern information management applications often require integrating data from a variety of data sources, some of which may copy or buy data from other sources. When these data sources model a dynamically changing world (e.g., people's contact information changes over time, restaurants open and go out of business), sources often provide out-of-date data. Errors can also creep into data when sources are updated often. Given out-of-date and erroneous data provided by different, possibly dependent, sources, it is challenging for data integration systems to provide the true values. Straightforward ways to resolve such inconsistencies (e.g., voting) may lead to noisy results, often with detrimental consequences. In this paper, we study the problem of finding true values and determining the copying relationship between sources, when the update history of the sources is known. We model the quality of sources over time by their coverage, exactness and freshness. Based on these measures, we conduct a probabilistic analysis. First, we develop a Hidden Markov Model that decides whether a source is a copier of another source and identifies the specific moments at which it copies. Second, we develop a Bayesian model that aggregates information from the sources to decide the true value for a data item, and the evolution of the true values over time. Experimental results on both real-world and synthetic data show high accuracy and scalability of our techniques.

AB - Modern information management applications often require integrating data from a variety of data sources, some of which may copy or buy data from other sources. When these data sources model a dynamically changing world (e.g., people's contact information changes over time, restaurants open and go out of business), sources often provide out-of-date data. Errors can also creep into data when sources are updated often. Given out-of-date and erroneous data provided by different, possibly dependent, sources, it is challenging for data integration systems to provide the true values. Straightforward ways to resolve such inconsistencies (e.g., voting) may lead to noisy results, often with detrimental consequences. In this paper, we study the problem of finding true values and determining the copying relationship between sources, when the update history of the sources is known. We model the quality of sources over time by their coverage, exactness and freshness. Based on these measures, we conduct a probabilistic analysis. First, we develop a Hidden Markov Model that decides whether a source is a copier of another source and identifies the specific moments at which it copies. Second, we develop a Bayesian model that aggregates information from the sources to decide the true value for a data item, and the evolution of the true values over time. Experimental results on both real-world and synthetic data show high accuracy and scalability of our techniques.

UR - http://www.scopus.com/inward/record.url?scp=77954323674&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=77954323674&partnerID=8YFLogxK

M3 - Chapter

VL - 2

SP - 562

EP - 573

BT - Proceedings of the VLDB Endowment

ER -