Approximate XML joins

Sudipto Guha, H. V. Jagadish, Nick Koudas, Divesh Srivastava, Ting Yu

Research output: Chapter in Book/Report/Conference proceedingConference contribution

107 Citations (Scopus)

Abstract

XML is widely recognized as the data interchange standard for tomorrow, because of its ability to represent data from a wide variety of sources. Hence, XML is likely to be the format through which data from multiple sources is integrated. In this paper we study the problem of integrating XML data sources through correlations realized as join operations. A challenging aspect of this operation is the XML document structure. Two documents might convey approximately or exactly the same information but may be quite different in structure. Consequently approximate match in structure, in addition to, content has to be folded in the join operation. We quantify approximate match in structure and content using well defined notions of distance. For structure, we propose computationally inexpensive lower and upper bounds for the tree edit distance metric between two trees. We then show how the tree edit distance, and other metrics that quantify distance between trees, can be incorporated in a join framework. We introduce the notion of reference sets to facilitate this operation. Intuitively, a reference set consists of data elements used to project the data space. We characterize what constitutes a good choice of a reference set and we propose sampling based algorithms to identify them. This gives rise to a variety of algorithmic approaches for the problem, which we formulate and analyze. We demonstrate the practical utility of our solutions using large collections of real and synthetic XML data sets.

Original languageEnglish
Title of host publicationProceedings of the ACM SIGMOD International Conference on Management of Data
EditorsM.F.B. Moon, A. Ailamaki
Pages287-298
Number of pages12
Publication statusPublished - 2002
Externally publishedYes
EventACM SIGMOD 2002 Proceedings of the ACM SIGMOD International Conference on Managment of Data - Madison, WI, United States
Duration: 3 Jun 20026 Jun 2002

Other

OtherACM SIGMOD 2002 Proceedings of the ACM SIGMOD International Conference on Managment of Data
CountryUnited States
CityMadison, WI
Period3/6/026/6/02

Fingerprint

XML
Interchanges
Sampling

ASJC Scopus subject areas

  • Computer Science(all)

Cite this

Guha, S., Jagadish, H. V., Koudas, N., Srivastava, D., & Yu, T. (2002). Approximate XML joins. In M. F. B. Moon, & A. Ailamaki (Eds.), Proceedings of the ACM SIGMOD International Conference on Management of Data (pp. 287-298)

Approximate XML joins. / Guha, Sudipto; Jagadish, H. V.; Koudas, Nick; Srivastava, Divesh; Yu, Ting.

Proceedings of the ACM SIGMOD International Conference on Management of Data. ed. / M.F.B. Moon; A. Ailamaki. 2002. p. 287-298.

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Guha, S, Jagadish, HV, Koudas, N, Srivastava, D & Yu, T 2002, Approximate XML joins. in MFB Moon & A Ailamaki (eds), Proceedings of the ACM SIGMOD International Conference on Management of Data. pp. 287-298, ACM SIGMOD 2002 Proceedings of the ACM SIGMOD International Conference on Managment of Data, Madison, WI, United States, 3/6/02.
Guha S, Jagadish HV, Koudas N, Srivastava D, Yu T. Approximate XML joins. In Moon MFB, Ailamaki A, editors, Proceedings of the ACM SIGMOD International Conference on Management of Data. 2002. p. 287-298
Guha, Sudipto ; Jagadish, H. V. ; Koudas, Nick ; Srivastava, Divesh ; Yu, Ting. / Approximate XML joins. Proceedings of the ACM SIGMOD International Conference on Management of Data. editor / M.F.B. Moon ; A. Ailamaki. 2002. pp. 287-298
@inproceedings{58c95db363424fd599cc4cd481962fb3,
title = "Approximate XML joins",
abstract = "XML is widely recognized as the data interchange standard for tomorrow, because of its ability to represent data from a wide variety of sources. Hence, XML is likely to be the format through which data from multiple sources is integrated. In this paper we study the problem of integrating XML data sources through correlations realized as join operations. A challenging aspect of this operation is the XML document structure. Two documents might convey approximately or exactly the same information but may be quite different in structure. Consequently approximate match in structure, in addition to, content has to be folded in the join operation. We quantify approximate match in structure and content using well defined notions of distance. For structure, we propose computationally inexpensive lower and upper bounds for the tree edit distance metric between two trees. We then show how the tree edit distance, and other metrics that quantify distance between trees, can be incorporated in a join framework. We introduce the notion of reference sets to facilitate this operation. Intuitively, a reference set consists of data elements used to project the data space. We characterize what constitutes a good choice of a reference set and we propose sampling based algorithms to identify them. This gives rise to a variety of algorithmic approaches for the problem, which we formulate and analyze. We demonstrate the practical utility of our solutions using large collections of real and synthetic XML data sets.",
author = "Sudipto Guha and Jagadish, {H. V.} and Nick Koudas and Divesh Srivastava and Ting Yu",
year = "2002",
language = "English",
pages = "287--298",
editor = "M.F.B. Moon and A. Ailamaki",
booktitle = "Proceedings of the ACM SIGMOD International Conference on Management of Data",

}

TY - GEN

T1 - Approximate XML joins

AU - Guha, Sudipto

AU - Jagadish, H. V.

AU - Koudas, Nick

AU - Srivastava, Divesh

AU - Yu, Ting

PY - 2002

Y1 - 2002

N2 - XML is widely recognized as the data interchange standard for tomorrow, because of its ability to represent data from a wide variety of sources. Hence, XML is likely to be the format through which data from multiple sources is integrated. In this paper we study the problem of integrating XML data sources through correlations realized as join operations. A challenging aspect of this operation is the XML document structure. Two documents might convey approximately or exactly the same information but may be quite different in structure. Consequently approximate match in structure, in addition to, content has to be folded in the join operation. We quantify approximate match in structure and content using well defined notions of distance. For structure, we propose computationally inexpensive lower and upper bounds for the tree edit distance metric between two trees. We then show how the tree edit distance, and other metrics that quantify distance between trees, can be incorporated in a join framework. We introduce the notion of reference sets to facilitate this operation. Intuitively, a reference set consists of data elements used to project the data space. We characterize what constitutes a good choice of a reference set and we propose sampling based algorithms to identify them. This gives rise to a variety of algorithmic approaches for the problem, which we formulate and analyze. We demonstrate the practical utility of our solutions using large collections of real and synthetic XML data sets.

AB - XML is widely recognized as the data interchange standard for tomorrow, because of its ability to represent data from a wide variety of sources. Hence, XML is likely to be the format through which data from multiple sources is integrated. In this paper we study the problem of integrating XML data sources through correlations realized as join operations. A challenging aspect of this operation is the XML document structure. Two documents might convey approximately or exactly the same information but may be quite different in structure. Consequently approximate match in structure, in addition to, content has to be folded in the join operation. We quantify approximate match in structure and content using well defined notions of distance. For structure, we propose computationally inexpensive lower and upper bounds for the tree edit distance metric between two trees. We then show how the tree edit distance, and other metrics that quantify distance between trees, can be incorporated in a join framework. We introduce the notion of reference sets to facilitate this operation. Intuitively, a reference set consists of data elements used to project the data space. We characterize what constitutes a good choice of a reference set and we propose sampling based algorithms to identify them. This gives rise to a variety of algorithmic approaches for the problem, which we formulate and analyze. We demonstrate the practical utility of our solutions using large collections of real and synthetic XML data sets.

UR - http://www.scopus.com/inward/record.url?scp=0036361233&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=0036361233&partnerID=8YFLogxK

M3 - Conference contribution

AN - SCOPUS:0036361233

SP - 287

EP - 298

BT - Proceedings of the ACM SIGMOD International Conference on Management of Data

A2 - Moon, M.F.B.

A2 - Ailamaki, A.

ER -