Integrating XML data sources using approximate joins

Sudipto Guha, H. V. Jagadish, Nick Koudas, Divesh Srivastava, Ting Yu

Research output: Contribution to journalArticle

19 Citations (Scopus)

Abstract

XML is widely recognized as the data interchange standard of tomorrow because of its ability to represent data from a variety of sources. Hence, XML is likely to be the format through which data from multiple sources is integrated. In this article, we study the problem of integrating XML data sources through correlations realized as join operations. A challenging aspect of this operation is the XML document structure. Two documents might convey approximately or exactly the same information but may be quite different in structure. Consequently, an approximate match in structure, in addition to content, has to be folded into the join operation. We quantify an approximate match in structure and content for pairs of XML documents using well defined notions of distance. We show how notions of distance that have metric properties can be incorporated in a framework for joins between XML data sources and introduce the idea of reference sets to facilitate this operation. Intuitively, a reference set consists of data elements used to project the data space. We characterize what constitutes a good choice of a reference set, and we propose sampling-based algorithms to identify them. We then instantiate our join framework using the tree edit distance between a pair of trees. We next turn our attention to utilizing well known index structures to improve the performance of approximate XML join operations. We present a methodology enabling adaptation of index structures for this problem, and we instantiate it in terms of the R-tree. We demonstrate the practical utility of our solutions using large collections of real and synthetic XML data sets, varying parameters of interest, and highlighting the performance benefits of our approach.

Original languageEnglish
Pages (from-to)161-207
Number of pages47
JournalACM Transactions on Database Systems
Volume31
Issue number1
DOIs
Publication statusPublished - 26 Jun 2006
Externally publishedYes

Fingerprint

XML
Interchanges
Sampling

Keywords

  • Approximate joins
  • Data integration
  • Joins
  • Tree edit distance
  • XML

ASJC Scopus subject areas

  • Information Systems
  • Computer Graphics and Computer-Aided Design
  • Software

Cite this

Integrating XML data sources using approximate joins. / Guha, Sudipto; Jagadish, H. V.; Koudas, Nick; Srivastava, Divesh; Yu, Ting.

In: ACM Transactions on Database Systems, Vol. 31, No. 1, 26.06.2006, p. 161-207.

Research output: Contribution to journalArticle

Guha, Sudipto ; Jagadish, H. V. ; Koudas, Nick ; Srivastava, Divesh ; Yu, Ting. / Integrating XML data sources using approximate joins. In: ACM Transactions on Database Systems. 2006 ; Vol. 31, No. 1. pp. 161-207.
@article{6b9d503e3e8e41fa80839624864ce8bf,
title = "Integrating XML data sources using approximate joins",
abstract = "XML is widely recognized as the data interchange standard of tomorrow because of its ability to represent data from a variety of sources. Hence, XML is likely to be the format through which data from multiple sources is integrated. In this article, we study the problem of integrating XML data sources through correlations realized as join operations. A challenging aspect of this operation is the XML document structure. Two documents might convey approximately or exactly the same information but may be quite different in structure. Consequently, an approximate match in structure, in addition to content, has to be folded into the join operation. We quantify an approximate match in structure and content for pairs of XML documents using well defined notions of distance. We show how notions of distance that have metric properties can be incorporated in a framework for joins between XML data sources and introduce the idea of reference sets to facilitate this operation. Intuitively, a reference set consists of data elements used to project the data space. We characterize what constitutes a good choice of a reference set, and we propose sampling-based algorithms to identify them. We then instantiate our join framework using the tree edit distance between a pair of trees. We next turn our attention to utilizing well known index structures to improve the performance of approximate XML join operations. We present a methodology enabling adaptation of index structures for this problem, and we instantiate it in terms of the R-tree. We demonstrate the practical utility of our solutions using large collections of real and synthetic XML data sets, varying parameters of interest, and highlighting the performance benefits of our approach.",
keywords = "Approximate joins, Data integration, Joins, Tree edit distance, XML",
author = "Sudipto Guha and Jagadish, {H. V.} and Nick Koudas and Divesh Srivastava and Ting Yu",
year = "2006",
month = "6",
day = "26",
doi = "10.1145/1132863.1132868",
language = "English",
volume = "31",
pages = "161--207",
journal = "ACM Transactions on Database Systems",
issn = "0362-5915",
publisher = "Association for Computing Machinery (ACM)",
number = "1",

}

TY - JOUR

T1 - Integrating XML data sources using approximate joins

AU - Guha, Sudipto

AU - Jagadish, H. V.

AU - Koudas, Nick

AU - Srivastava, Divesh

AU - Yu, Ting

PY - 2006/6/26

Y1 - 2006/6/26

N2 - XML is widely recognized as the data interchange standard of tomorrow because of its ability to represent data from a variety of sources. Hence, XML is likely to be the format through which data from multiple sources is integrated. In this article, we study the problem of integrating XML data sources through correlations realized as join operations. A challenging aspect of this operation is the XML document structure. Two documents might convey approximately or exactly the same information but may be quite different in structure. Consequently, an approximate match in structure, in addition to content, has to be folded into the join operation. We quantify an approximate match in structure and content for pairs of XML documents using well defined notions of distance. We show how notions of distance that have metric properties can be incorporated in a framework for joins between XML data sources and introduce the idea of reference sets to facilitate this operation. Intuitively, a reference set consists of data elements used to project the data space. We characterize what constitutes a good choice of a reference set, and we propose sampling-based algorithms to identify them. We then instantiate our join framework using the tree edit distance between a pair of trees. We next turn our attention to utilizing well known index structures to improve the performance of approximate XML join operations. We present a methodology enabling adaptation of index structures for this problem, and we instantiate it in terms of the R-tree. We demonstrate the practical utility of our solutions using large collections of real and synthetic XML data sets, varying parameters of interest, and highlighting the performance benefits of our approach.

AB - XML is widely recognized as the data interchange standard of tomorrow because of its ability to represent data from a variety of sources. Hence, XML is likely to be the format through which data from multiple sources is integrated. In this article, we study the problem of integrating XML data sources through correlations realized as join operations. A challenging aspect of this operation is the XML document structure. Two documents might convey approximately or exactly the same information but may be quite different in structure. Consequently, an approximate match in structure, in addition to content, has to be folded into the join operation. We quantify an approximate match in structure and content for pairs of XML documents using well defined notions of distance. We show how notions of distance that have metric properties can be incorporated in a framework for joins between XML data sources and introduce the idea of reference sets to facilitate this operation. Intuitively, a reference set consists of data elements used to project the data space. We characterize what constitutes a good choice of a reference set, and we propose sampling-based algorithms to identify them. We then instantiate our join framework using the tree edit distance between a pair of trees. We next turn our attention to utilizing well known index structures to improve the performance of approximate XML join operations. We present a methodology enabling adaptation of index structures for this problem, and we instantiate it in terms of the R-tree. We demonstrate the practical utility of our solutions using large collections of real and synthetic XML data sets, varying parameters of interest, and highlighting the performance benefits of our approach.

KW - Approximate joins

KW - Data integration

KW - Joins

KW - Tree edit distance

KW - XML

UR - http://www.scopus.com/inward/record.url?scp=33745218927&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=33745218927&partnerID=8YFLogxK

U2 - 10.1145/1132863.1132868

DO - 10.1145/1132863.1132868

M3 - Article

AN - SCOPUS:33745218927

VL - 31

SP - 161

EP - 207

JO - ACM Transactions on Database Systems

JF - ACM Transactions on Database Systems

SN - 0362-5915

IS - 1

ER -