Integrating conflicting data: The role of source dependence

Xin Luna Dong, Laure Berti-Equille, Divesh Srivastava

Research output: Chapter in Book/Report/Conference proceedingChapter

261 Citations (Scopus)

Abstract

Many data management applications, such as setting up Web portals, managing enterprise data, managing community data, and sharing scientific data, require integrating data from multiple sources. Each of these sources provides a set of values and different sources can often provide conflicting values. To present quality data to users, it is critical that data integration systems can resolve conflicts and discover true values. Typically, we expect a true value to be provided by more sources than any particular false one, so we can take the value provided by the majority of the sources as the truth. Unfortunately, a false value can be spread through copying and that makes truth discovery extremely tricky. In this paper, we consider how to find true values from conflicting information when there are a large number of sources, among which some may copy from others. We present a novel approach that considers dependence between data sources in truth discovery. Intuitively, if two data sources provide a large number of common values and many of these values are rarely provided by other sources (e.g., particular false values), it is very likely that one copies from the other. We apply Bayesian analysis to decide dependence between sources and design an algorithm that iteratively detects dependence and discovers truth from conflicting information. We also extend our model by considering accuracy of data sources and similarity between values. Our experiments on synthetic data as well as real-world data show that our algorithm can significantly improve accuracy of truth discovery and is scalable when there are a large number of data sources.

Original languageEnglish
Title of host publicationProceedings of the VLDB Endowment
Pages550-561
Number of pages12
Volume2
Edition1
Publication statusPublished - 2009
Externally publishedYes

Fingerprint

Copying
Data integration
Information management
Industry
Experiments

ASJC Scopus subject areas

  • Computer Science (miscellaneous)
  • Computer Science(all)

Cite this

Dong, X. L., Berti-Equille, L., & Srivastava, D. (2009). Integrating conflicting data: The role of source dependence. In Proceedings of the VLDB Endowment (1 ed., Vol. 2, pp. 550-561)

Integrating conflicting data : The role of source dependence. / Dong, Xin Luna; Berti-Equille, Laure; Srivastava, Divesh.

Proceedings of the VLDB Endowment. Vol. 2 1. ed. 2009. p. 550-561.

Research output: Chapter in Book/Report/Conference proceedingChapter

Dong, XL, Berti-Equille, L & Srivastava, D 2009, Integrating conflicting data: The role of source dependence. in Proceedings of the VLDB Endowment. 1 edn, vol. 2, pp. 550-561.
Dong XL, Berti-Equille L, Srivastava D. Integrating conflicting data: The role of source dependence. In Proceedings of the VLDB Endowment. 1 ed. Vol. 2. 2009. p. 550-561
Dong, Xin Luna ; Berti-Equille, Laure ; Srivastava, Divesh. / Integrating conflicting data : The role of source dependence. Proceedings of the VLDB Endowment. Vol. 2 1. ed. 2009. pp. 550-561
@inbook{954aed0f07a846b79cf6b570b6fd1156,
title = "Integrating conflicting data: The role of source dependence",
abstract = "Many data management applications, such as setting up Web portals, managing enterprise data, managing community data, and sharing scientific data, require integrating data from multiple sources. Each of these sources provides a set of values and different sources can often provide conflicting values. To present quality data to users, it is critical that data integration systems can resolve conflicts and discover true values. Typically, we expect a true value to be provided by more sources than any particular false one, so we can take the value provided by the majority of the sources as the truth. Unfortunately, a false value can be spread through copying and that makes truth discovery extremely tricky. In this paper, we consider how to find true values from conflicting information when there are a large number of sources, among which some may copy from others. We present a novel approach that considers dependence between data sources in truth discovery. Intuitively, if two data sources provide a large number of common values and many of these values are rarely provided by other sources (e.g., particular false values), it is very likely that one copies from the other. We apply Bayesian analysis to decide dependence between sources and design an algorithm that iteratively detects dependence and discovers truth from conflicting information. We also extend our model by considering accuracy of data sources and similarity between values. Our experiments on synthetic data as well as real-world data show that our algorithm can significantly improve accuracy of truth discovery and is scalable when there are a large number of data sources.",
author = "Dong, {Xin Luna} and Laure Berti-Equille and Divesh Srivastava",
year = "2009",
language = "English",
volume = "2",
pages = "550--561",
booktitle = "Proceedings of the VLDB Endowment",
edition = "1",

}

TY - CHAP

T1 - Integrating conflicting data

T2 - The role of source dependence

AU - Dong, Xin Luna

AU - Berti-Equille, Laure

AU - Srivastava, Divesh

PY - 2009

Y1 - 2009

N2 - Many data management applications, such as setting up Web portals, managing enterprise data, managing community data, and sharing scientific data, require integrating data from multiple sources. Each of these sources provides a set of values and different sources can often provide conflicting values. To present quality data to users, it is critical that data integration systems can resolve conflicts and discover true values. Typically, we expect a true value to be provided by more sources than any particular false one, so we can take the value provided by the majority of the sources as the truth. Unfortunately, a false value can be spread through copying and that makes truth discovery extremely tricky. In this paper, we consider how to find true values from conflicting information when there are a large number of sources, among which some may copy from others. We present a novel approach that considers dependence between data sources in truth discovery. Intuitively, if two data sources provide a large number of common values and many of these values are rarely provided by other sources (e.g., particular false values), it is very likely that one copies from the other. We apply Bayesian analysis to decide dependence between sources and design an algorithm that iteratively detects dependence and discovers truth from conflicting information. We also extend our model by considering accuracy of data sources and similarity between values. Our experiments on synthetic data as well as real-world data show that our algorithm can significantly improve accuracy of truth discovery and is scalable when there are a large number of data sources.

AB - Many data management applications, such as setting up Web portals, managing enterprise data, managing community data, and sharing scientific data, require integrating data from multiple sources. Each of these sources provides a set of values and different sources can often provide conflicting values. To present quality data to users, it is critical that data integration systems can resolve conflicts and discover true values. Typically, we expect a true value to be provided by more sources than any particular false one, so we can take the value provided by the majority of the sources as the truth. Unfortunately, a false value can be spread through copying and that makes truth discovery extremely tricky. In this paper, we consider how to find true values from conflicting information when there are a large number of sources, among which some may copy from others. We present a novel approach that considers dependence between data sources in truth discovery. Intuitively, if two data sources provide a large number of common values and many of these values are rarely provided by other sources (e.g., particular false values), it is very likely that one copies from the other. We apply Bayesian analysis to decide dependence between sources and design an algorithm that iteratively detects dependence and discovers truth from conflicting information. We also extend our model by considering accuracy of data sources and similarity between values. Our experiments on synthetic data as well as real-world data show that our algorithm can significantly improve accuracy of truth discovery and is scalable when there are a large number of data sources.

UR - http://www.scopus.com/inward/record.url?scp=77954322933&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=77954322933&partnerID=8YFLogxK

M3 - Chapter

AN - SCOPUS:77954322933

VL - 2

SP - 550

EP - 561

BT - Proceedings of the VLDB Endowment

ER -