Global detection of complex copying relationships between sources

Xin Luna Dong, Laure Berti-Equille, Yifan Hu, Divesh Srivastava

Research output: Chapter in Book/Report/Conference proceedingChapter

82 Citations (Scopus)

Abstract

Web technologies have enabled data sharing between sources but also simplified copying (and often publishing without proper attribution). The copying relationships can be complex: some sources copy from multiple sources on different subsets of data; some co-copy from the same source, and some transitively copy from another. Understanding such copying relationships is desirable both for business purposes and for improving many key components in data integration, such as resolving conflicts across various sources, reconciling distinct references to the same real-world entity, and efficiently answering queries over multiple sources. Recent works have studied how to detect copying between a pair of sources, but the techniques can fall short in the presence of complex copying relationships. In this paper we describe techniques that discover global copying relationships between a set of structured sources. Towards this goal we make two contributions. First, we propose a global detection algorithm that identifies co-copying and transitive copying, returning only source pairs with direct copying. Second, global detection requires accurate decisions on copying direction; we significantly improve over previous techniques on this by considering various types of evidence for copying and correlation of copying on different data items. Experimental results on real-world data and synthetic data show high effectiveness and efficiency of our techniques.

Original languageEnglish
Title of host publicationProceedings of the VLDB Endowment
Pages1358-1369
Number of pages12
Volume3
Edition1
Publication statusPublished - Sep 2010
Externally publishedYes

Fingerprint

Copying
Data integration

ASJC Scopus subject areas

  • Computer Science (miscellaneous)
  • Computer Science(all)

Cite this

Dong, X. L., Berti-Equille, L., Hu, Y., & Srivastava, D. (2010). Global detection of complex copying relationships between sources. In Proceedings of the VLDB Endowment (1 ed., Vol. 3, pp. 1358-1369)

Global detection of complex copying relationships between sources. / Dong, Xin Luna; Berti-Equille, Laure; Hu, Yifan; Srivastava, Divesh.

Proceedings of the VLDB Endowment. Vol. 3 1. ed. 2010. p. 1358-1369.

Research output: Chapter in Book/Report/Conference proceedingChapter

Dong, XL, Berti-Equille, L, Hu, Y & Srivastava, D 2010, Global detection of complex copying relationships between sources. in Proceedings of the VLDB Endowment. 1 edn, vol. 3, pp. 1358-1369.
Dong XL, Berti-Equille L, Hu Y, Srivastava D. Global detection of complex copying relationships between sources. In Proceedings of the VLDB Endowment. 1 ed. Vol. 3. 2010. p. 1358-1369
Dong, Xin Luna ; Berti-Equille, Laure ; Hu, Yifan ; Srivastava, Divesh. / Global detection of complex copying relationships between sources. Proceedings of the VLDB Endowment. Vol. 3 1. ed. 2010. pp. 1358-1369
@inbook{8e80995e03664159a4ec31c82524d786,
title = "Global detection of complex copying relationships between sources",
abstract = "Web technologies have enabled data sharing between sources but also simplified copying (and often publishing without proper attribution). The copying relationships can be complex: some sources copy from multiple sources on different subsets of data; some co-copy from the same source, and some transitively copy from another. Understanding such copying relationships is desirable both for business purposes and for improving many key components in data integration, such as resolving conflicts across various sources, reconciling distinct references to the same real-world entity, and efficiently answering queries over multiple sources. Recent works have studied how to detect copying between a pair of sources, but the techniques can fall short in the presence of complex copying relationships. In this paper we describe techniques that discover global copying relationships between a set of structured sources. Towards this goal we make two contributions. First, we propose a global detection algorithm that identifies co-copying and transitive copying, returning only source pairs with direct copying. Second, global detection requires accurate decisions on copying direction; we significantly improve over previous techniques on this by considering various types of evidence for copying and correlation of copying on different data items. Experimental results on real-world data and synthetic data show high effectiveness and efficiency of our techniques.",
author = "Dong, {Xin Luna} and Laure Berti-Equille and Yifan Hu and Divesh Srivastava",
year = "2010",
month = "9",
language = "English",
volume = "3",
pages = "1358--1369",
booktitle = "Proceedings of the VLDB Endowment",
edition = "1",

}

TY - CHAP

T1 - Global detection of complex copying relationships between sources

AU - Dong, Xin Luna

AU - Berti-Equille, Laure

AU - Hu, Yifan

AU - Srivastava, Divesh

PY - 2010/9

Y1 - 2010/9

N2 - Web technologies have enabled data sharing between sources but also simplified copying (and often publishing without proper attribution). The copying relationships can be complex: some sources copy from multiple sources on different subsets of data; some co-copy from the same source, and some transitively copy from another. Understanding such copying relationships is desirable both for business purposes and for improving many key components in data integration, such as resolving conflicts across various sources, reconciling distinct references to the same real-world entity, and efficiently answering queries over multiple sources. Recent works have studied how to detect copying between a pair of sources, but the techniques can fall short in the presence of complex copying relationships. In this paper we describe techniques that discover global copying relationships between a set of structured sources. Towards this goal we make two contributions. First, we propose a global detection algorithm that identifies co-copying and transitive copying, returning only source pairs with direct copying. Second, global detection requires accurate decisions on copying direction; we significantly improve over previous techniques on this by considering various types of evidence for copying and correlation of copying on different data items. Experimental results on real-world data and synthetic data show high effectiveness and efficiency of our techniques.

AB - Web technologies have enabled data sharing between sources but also simplified copying (and often publishing without proper attribution). The copying relationships can be complex: some sources copy from multiple sources on different subsets of data; some co-copy from the same source, and some transitively copy from another. Understanding such copying relationships is desirable both for business purposes and for improving many key components in data integration, such as resolving conflicts across various sources, reconciling distinct references to the same real-world entity, and efficiently answering queries over multiple sources. Recent works have studied how to detect copying between a pair of sources, but the techniques can fall short in the presence of complex copying relationships. In this paper we describe techniques that discover global copying relationships between a set of structured sources. Towards this goal we make two contributions. First, we propose a global detection algorithm that identifies co-copying and transitive copying, returning only source pairs with direct copying. Second, global detection requires accurate decisions on copying direction; we significantly improve over previous techniques on this by considering various types of evidence for copying and correlation of copying on different data items. Experimental results on real-world data and synthetic data show high effectiveness and efficiency of our techniques.

UR - http://www.scopus.com/inward/record.url?scp=84859258624&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=84859258624&partnerID=8YFLogxK

M3 - Chapter

VL - 3

SP - 1358

EP - 1369

BT - Proceedings of the VLDB Endowment

ER -