Data fusion: Resolving conflicts from multiple sources

Xin Luna Dong, Laure Berti-Equille, Divesh Srivastava

Research output: Chapter in Book/Report/Conference proceedingChapter

12 Citations (Scopus)

Abstract

Many data management applications, such as setting up Web portals, managing enterprise data, managing community data, and sharing scientific data, require integrating data from multiple sources. Each of these sources provides a set of values, and different sources can often provide conflicting values. To present quality data to users, it is critical to resolve conflicts and discover values that reflect the real world; this task is called data fusion. Typically, we expect a true value to be provided by more sources than any particular false one, so we can take the value provided by the largest number of sources as the truth. Unfortunately, a false value can be spread through copying and that makes truth discovery extremely tricky. In this chapter, we consider how to find true values from conflicting information when there are a large number of sources, among which some may copy from others. We describe a novel approach that considers copying between data sources in truth discovery. Intuitively, if two data sources provide a large number of common values and many of these values are unlikely to be provided by other sources (e.g., particular false values), it is very likely that one copies from the other. We apply Bayesian analysis to decide copying between sources and design an algorithm that iteratively detects dependence and discovers truth from conflicting information.We also consider accuracy of data sources and similarity between values in fusion to further improve the results.We present a case study on real-world data showing that the described algorithm can significantly improve accuracy of truth discovery and is scalable when there are a large number of data sources.

Original languageEnglish
Title of host publicationHandbook of Data Quality
Subtitle of host publicationResearch and Practice
PublisherSpringer Berlin Heidelberg
Pages293-318
Number of pages26
ISBN (Electronic)9783642362576
ISBN (Print)9783642362569
DOIs
Publication statusPublished - 1 Jan 2013

Fingerprint

Copying
Data fusion
Information management
Industry

ASJC Scopus subject areas

  • Computer Science(all)

Cite this

Dong, X. L., Berti-Equille, L., & Srivastava, D. (2013). Data fusion: Resolving conflicts from multiple sources. In Handbook of Data Quality: Research and Practice (pp. 293-318). Springer Berlin Heidelberg. https://doi.org/10.1007/978-3-642-36257-6_13

Data fusion : Resolving conflicts from multiple sources. / Dong, Xin Luna; Berti-Equille, Laure; Srivastava, Divesh.

Handbook of Data Quality: Research and Practice. Springer Berlin Heidelberg, 2013. p. 293-318.

Research output: Chapter in Book/Report/Conference proceedingChapter

Dong, XL, Berti-Equille, L & Srivastava, D 2013, Data fusion: Resolving conflicts from multiple sources. in Handbook of Data Quality: Research and Practice. Springer Berlin Heidelberg, pp. 293-318. https://doi.org/10.1007/978-3-642-36257-6_13
Dong XL, Berti-Equille L, Srivastava D. Data fusion: Resolving conflicts from multiple sources. In Handbook of Data Quality: Research and Practice. Springer Berlin Heidelberg. 2013. p. 293-318 https://doi.org/10.1007/978-3-642-36257-6_13
Dong, Xin Luna ; Berti-Equille, Laure ; Srivastava, Divesh. / Data fusion : Resolving conflicts from multiple sources. Handbook of Data Quality: Research and Practice. Springer Berlin Heidelberg, 2013. pp. 293-318
@inbook{56b633aa823f488088d7317598465759,
title = "Data fusion: Resolving conflicts from multiple sources",
abstract = "Many data management applications, such as setting up Web portals, managing enterprise data, managing community data, and sharing scientific data, require integrating data from multiple sources. Each of these sources provides a set of values, and different sources can often provide conflicting values. To present quality data to users, it is critical to resolve conflicts and discover values that reflect the real world; this task is called data fusion. Typically, we expect a true value to be provided by more sources than any particular false one, so we can take the value provided by the largest number of sources as the truth. Unfortunately, a false value can be spread through copying and that makes truth discovery extremely tricky. In this chapter, we consider how to find true values from conflicting information when there are a large number of sources, among which some may copy from others. We describe a novel approach that considers copying between data sources in truth discovery. Intuitively, if two data sources provide a large number of common values and many of these values are unlikely to be provided by other sources (e.g., particular false values), it is very likely that one copies from the other. We apply Bayesian analysis to decide copying between sources and design an algorithm that iteratively detects dependence and discovers truth from conflicting information.We also consider accuracy of data sources and similarity between values in fusion to further improve the results.We present a case study on real-world data showing that the described algorithm can significantly improve accuracy of truth discovery and is scalable when there are a large number of data sources.",
author = "Dong, {Xin Luna} and Laure Berti-Equille and Divesh Srivastava",
year = "2013",
month = "1",
day = "1",
doi = "10.1007/978-3-642-36257-6_13",
language = "English",
isbn = "9783642362569",
pages = "293--318",
booktitle = "Handbook of Data Quality",
publisher = "Springer Berlin Heidelberg",

}

TY - CHAP

T1 - Data fusion

T2 - Resolving conflicts from multiple sources

AU - Dong, Xin Luna

AU - Berti-Equille, Laure

AU - Srivastava, Divesh

PY - 2013/1/1

Y1 - 2013/1/1

N2 - Many data management applications, such as setting up Web portals, managing enterprise data, managing community data, and sharing scientific data, require integrating data from multiple sources. Each of these sources provides a set of values, and different sources can often provide conflicting values. To present quality data to users, it is critical to resolve conflicts and discover values that reflect the real world; this task is called data fusion. Typically, we expect a true value to be provided by more sources than any particular false one, so we can take the value provided by the largest number of sources as the truth. Unfortunately, a false value can be spread through copying and that makes truth discovery extremely tricky. In this chapter, we consider how to find true values from conflicting information when there are a large number of sources, among which some may copy from others. We describe a novel approach that considers copying between data sources in truth discovery. Intuitively, if two data sources provide a large number of common values and many of these values are unlikely to be provided by other sources (e.g., particular false values), it is very likely that one copies from the other. We apply Bayesian analysis to decide copying between sources and design an algorithm that iteratively detects dependence and discovers truth from conflicting information.We also consider accuracy of data sources and similarity between values in fusion to further improve the results.We present a case study on real-world data showing that the described algorithm can significantly improve accuracy of truth discovery and is scalable when there are a large number of data sources.

AB - Many data management applications, such as setting up Web portals, managing enterprise data, managing community data, and sharing scientific data, require integrating data from multiple sources. Each of these sources provides a set of values, and different sources can often provide conflicting values. To present quality data to users, it is critical to resolve conflicts and discover values that reflect the real world; this task is called data fusion. Typically, we expect a true value to be provided by more sources than any particular false one, so we can take the value provided by the largest number of sources as the truth. Unfortunately, a false value can be spread through copying and that makes truth discovery extremely tricky. In this chapter, we consider how to find true values from conflicting information when there are a large number of sources, among which some may copy from others. We describe a novel approach that considers copying between data sources in truth discovery. Intuitively, if two data sources provide a large number of common values and many of these values are unlikely to be provided by other sources (e.g., particular false values), it is very likely that one copies from the other. We apply Bayesian analysis to decide copying between sources and design an algorithm that iteratively detects dependence and discovers truth from conflicting information.We also consider accuracy of data sources and similarity between values in fusion to further improve the results.We present a case study on real-world data showing that the described algorithm can significantly improve accuracy of truth discovery and is scalable when there are a large number of data sources.

UR - http://www.scopus.com/inward/record.url?scp=84963741741&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=84963741741&partnerID=8YFLogxK

U2 - 10.1007/978-3-642-36257-6_13

DO - 10.1007/978-3-642-36257-6_13

M3 - Chapter

AN - SCOPUS:84963741741

SN - 9783642362569

SP - 293

EP - 318

BT - Handbook of Data Quality

PB - Springer Berlin Heidelberg

ER -