Schema clustering and retrieval for multi-domain pay-as-you-go data integration systems

Hatem A. Mahmoud, Ashraf Aboulnaga

Research output: Chapter in Book/Report/Conference proceedingConference contribution

26 Citations (Scopus)

Abstract

A data integration system offers a single interface to multiple structured data sources. Many application contexts (e.g., searching structured data on the web) involve the integration of large numbers of structured data sources. At web scale, it is impractical to use manual or semi-automatic data integration methods, so a pay-as-you-go approach is more appropriate. A pay-as-you-go approach entails using a fully automatic approximate data integration technique to provide an initial data integration system (i.e., an initial mediated schema, and initial mappings from source schemas to the mediated schema), and then refining the system as it gets used. Previous research has investigated automatic approximate data integration techniques, but all existing techniques require the schemas being integrated to belong to the same conceptual domain. At web scale, it is impractical to classify schemas into domains manually or semi-automatically, which limits the applicability of these techniques. In this paper, we present an approach for clustering schemas into domains without any human intervention and based only on the names of attributes in the schemas. Our clustering approach deals with uncertainty in assigning schemas to domains using a probabilistic model. We also propose a query classifier that determines, for a given a keyword query, the most relevant domains to this query. We experimentally demonstrate the effectiveness of our schema clustering and query classification techniques.

Original languageEnglish
Title of host publicationProceedings of the ACM SIGMOD International Conference on Management of Data
Pages411-422
Number of pages12
DOIs
Publication statusPublished - 23 Jul 2010
Externally publishedYes
Event2010 International Conference on Management of Data, SIGMOD '10 - Indianapolis, IN, United States
Duration: 6 Jun 201011 Jun 2010

Other

Other2010 International Conference on Management of Data, SIGMOD '10
CountryUnited States
CityIndianapolis, IN
Period6/6/1011/6/10

Fingerprint

Data integration
Refining
Classifiers

Keywords

  • classification
  • clustering
  • data integration

ASJC Scopus subject areas

  • Software
  • Information Systems

Cite this

Mahmoud, H. A., & Aboulnaga, A. (2010). Schema clustering and retrieval for multi-domain pay-as-you-go data integration systems. In Proceedings of the ACM SIGMOD International Conference on Management of Data (pp. 411-422) https://doi.org/10.1145/1807167.1807213

Schema clustering and retrieval for multi-domain pay-as-you-go data integration systems. / Mahmoud, Hatem A.; Aboulnaga, Ashraf.

Proceedings of the ACM SIGMOD International Conference on Management of Data. 2010. p. 411-422.

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Mahmoud, HA & Aboulnaga, A 2010, Schema clustering and retrieval for multi-domain pay-as-you-go data integration systems. in Proceedings of the ACM SIGMOD International Conference on Management of Data. pp. 411-422, 2010 International Conference on Management of Data, SIGMOD '10, Indianapolis, IN, United States, 6/6/10. https://doi.org/10.1145/1807167.1807213
Mahmoud HA, Aboulnaga A. Schema clustering and retrieval for multi-domain pay-as-you-go data integration systems. In Proceedings of the ACM SIGMOD International Conference on Management of Data. 2010. p. 411-422 https://doi.org/10.1145/1807167.1807213
Mahmoud, Hatem A. ; Aboulnaga, Ashraf. / Schema clustering and retrieval for multi-domain pay-as-you-go data integration systems. Proceedings of the ACM SIGMOD International Conference on Management of Data. 2010. pp. 411-422
@inproceedings{840c1f496c364e23991d8c3cf8f3e12f,
title = "Schema clustering and retrieval for multi-domain pay-as-you-go data integration systems",
abstract = "A data integration system offers a single interface to multiple structured data sources. Many application contexts (e.g., searching structured data on the web) involve the integration of large numbers of structured data sources. At web scale, it is impractical to use manual or semi-automatic data integration methods, so a pay-as-you-go approach is more appropriate. A pay-as-you-go approach entails using a fully automatic approximate data integration technique to provide an initial data integration system (i.e., an initial mediated schema, and initial mappings from source schemas to the mediated schema), and then refining the system as it gets used. Previous research has investigated automatic approximate data integration techniques, but all existing techniques require the schemas being integrated to belong to the same conceptual domain. At web scale, it is impractical to classify schemas into domains manually or semi-automatically, which limits the applicability of these techniques. In this paper, we present an approach for clustering schemas into domains without any human intervention and based only on the names of attributes in the schemas. Our clustering approach deals with uncertainty in assigning schemas to domains using a probabilistic model. We also propose a query classifier that determines, for a given a keyword query, the most relevant domains to this query. We experimentally demonstrate the effectiveness of our schema clustering and query classification techniques.",
keywords = "classification, clustering, data integration",
author = "Mahmoud, {Hatem A.} and Ashraf Aboulnaga",
year = "2010",
month = "7",
day = "23",
doi = "10.1145/1807167.1807213",
language = "English",
isbn = "9781450300322",
pages = "411--422",
booktitle = "Proceedings of the ACM SIGMOD International Conference on Management of Data",

}

TY - GEN

T1 - Schema clustering and retrieval for multi-domain pay-as-you-go data integration systems

AU - Mahmoud, Hatem A.

AU - Aboulnaga, Ashraf

PY - 2010/7/23

Y1 - 2010/7/23

N2 - A data integration system offers a single interface to multiple structured data sources. Many application contexts (e.g., searching structured data on the web) involve the integration of large numbers of structured data sources. At web scale, it is impractical to use manual or semi-automatic data integration methods, so a pay-as-you-go approach is more appropriate. A pay-as-you-go approach entails using a fully automatic approximate data integration technique to provide an initial data integration system (i.e., an initial mediated schema, and initial mappings from source schemas to the mediated schema), and then refining the system as it gets used. Previous research has investigated automatic approximate data integration techniques, but all existing techniques require the schemas being integrated to belong to the same conceptual domain. At web scale, it is impractical to classify schemas into domains manually or semi-automatically, which limits the applicability of these techniques. In this paper, we present an approach for clustering schemas into domains without any human intervention and based only on the names of attributes in the schemas. Our clustering approach deals with uncertainty in assigning schemas to domains using a probabilistic model. We also propose a query classifier that determines, for a given a keyword query, the most relevant domains to this query. We experimentally demonstrate the effectiveness of our schema clustering and query classification techniques.

AB - A data integration system offers a single interface to multiple structured data sources. Many application contexts (e.g., searching structured data on the web) involve the integration of large numbers of structured data sources. At web scale, it is impractical to use manual or semi-automatic data integration methods, so a pay-as-you-go approach is more appropriate. A pay-as-you-go approach entails using a fully automatic approximate data integration technique to provide an initial data integration system (i.e., an initial mediated schema, and initial mappings from source schemas to the mediated schema), and then refining the system as it gets used. Previous research has investigated automatic approximate data integration techniques, but all existing techniques require the schemas being integrated to belong to the same conceptual domain. At web scale, it is impractical to classify schemas into domains manually or semi-automatically, which limits the applicability of these techniques. In this paper, we present an approach for clustering schemas into domains without any human intervention and based only on the names of attributes in the schemas. Our clustering approach deals with uncertainty in assigning schemas to domains using a probabilistic model. We also propose a query classifier that determines, for a given a keyword query, the most relevant domains to this query. We experimentally demonstrate the effectiveness of our schema clustering and query classification techniques.

KW - classification

KW - clustering

KW - data integration

UR - http://www.scopus.com/inward/record.url?scp=77954705062&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=77954705062&partnerID=8YFLogxK

U2 - 10.1145/1807167.1807213

DO - 10.1145/1807167.1807213

M3 - Conference contribution

SN - 9781450300322

SP - 411

EP - 422

BT - Proceedings of the ACM SIGMOD International Conference on Management of Data

ER -