Contextual data extraction and instancebased integration

Lorenzo Blanco, Valter Crescenzi, Paolo Merialdo, Paolo Papotti

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Abstract

We propose a formal framework for an unsupervised approach tacking at the same time two problems: The data extraction problem, for generating the extraction rules needed to gain data from web pages, and the data integration problem, to integrate the data coming from several sources. We motivate the approach by discussing its advantages with regard to the traditional "waterfall approach", in which data are wholly extracted before the integration starts without any mutual dependency between the two tasks. In this paper, we focus on data that are exposed by structured and redundant web sources. We introduce novel polynomial algorithms to solve the stated problems and present theoretical results on the properties of the solution generated by our approach. Finally, a preliminary experimental evaluation show the applicability of our model with real-world websites.

Original languageEnglish
Title of host publicationCEUR Workshop Proceedings
Volume880
Publication statusPublished - 2011
Externally publishedYes
Event1st International Workshop on Searching and Integrating New Web Data Sources - Very Large Data Search, VLDS 2011 - Seattle, WA, United States
Duration: 2 Sep 20112 Sep 2011

Other

Other1st International Workshop on Searching and Integrating New Web Data Sources - Very Large Data Search, VLDS 2011
CountryUnited States
CitySeattle, WA
Period2/9/112/9/11

Fingerprint

Websites
Data integration
Polynomials

ASJC Scopus subject areas

  • Computer Science(all)

Cite this

Blanco, L., Crescenzi, V., Merialdo, P., & Papotti, P. (2011). Contextual data extraction and instancebased integration. In CEUR Workshop Proceedings (Vol. 880)

Contextual data extraction and instancebased integration. / Blanco, Lorenzo; Crescenzi, Valter; Merialdo, Paolo; Papotti, Paolo.

CEUR Workshop Proceedings. Vol. 880 2011.

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Blanco, L, Crescenzi, V, Merialdo, P & Papotti, P 2011, Contextual data extraction and instancebased integration. in CEUR Workshop Proceedings. vol. 880, 1st International Workshop on Searching and Integrating New Web Data Sources - Very Large Data Search, VLDS 2011, Seattle, WA, United States, 2/9/11.
Blanco L, Crescenzi V, Merialdo P, Papotti P. Contextual data extraction and instancebased integration. In CEUR Workshop Proceedings. Vol. 880. 2011
Blanco, Lorenzo ; Crescenzi, Valter ; Merialdo, Paolo ; Papotti, Paolo. / Contextual data extraction and instancebased integration. CEUR Workshop Proceedings. Vol. 880 2011.
@inproceedings{72b65ed2939645fc925e3aaf8cbae42d,
title = "Contextual data extraction and instancebased integration",
abstract = "We propose a formal framework for an unsupervised approach tacking at the same time two problems: The data extraction problem, for generating the extraction rules needed to gain data from web pages, and the data integration problem, to integrate the data coming from several sources. We motivate the approach by discussing its advantages with regard to the traditional {"}waterfall approach{"}, in which data are wholly extracted before the integration starts without any mutual dependency between the two tasks. In this paper, we focus on data that are exposed by structured and redundant web sources. We introduce novel polynomial algorithms to solve the stated problems and present theoretical results on the properties of the solution generated by our approach. Finally, a preliminary experimental evaluation show the applicability of our model with real-world websites.",
author = "Lorenzo Blanco and Valter Crescenzi and Paolo Merialdo and Paolo Papotti",
year = "2011",
language = "English",
volume = "880",
booktitle = "CEUR Workshop Proceedings",

}

TY - GEN

T1 - Contextual data extraction and instancebased integration

AU - Blanco, Lorenzo

AU - Crescenzi, Valter

AU - Merialdo, Paolo

AU - Papotti, Paolo

PY - 2011

Y1 - 2011

N2 - We propose a formal framework for an unsupervised approach tacking at the same time two problems: The data extraction problem, for generating the extraction rules needed to gain data from web pages, and the data integration problem, to integrate the data coming from several sources. We motivate the approach by discussing its advantages with regard to the traditional "waterfall approach", in which data are wholly extracted before the integration starts without any mutual dependency between the two tasks. In this paper, we focus on data that are exposed by structured and redundant web sources. We introduce novel polynomial algorithms to solve the stated problems and present theoretical results on the properties of the solution generated by our approach. Finally, a preliminary experimental evaluation show the applicability of our model with real-world websites.

AB - We propose a formal framework for an unsupervised approach tacking at the same time two problems: The data extraction problem, for generating the extraction rules needed to gain data from web pages, and the data integration problem, to integrate the data coming from several sources. We motivate the approach by discussing its advantages with regard to the traditional "waterfall approach", in which data are wholly extracted before the integration starts without any mutual dependency between the two tasks. In this paper, we focus on data that are exposed by structured and redundant web sources. We introduce novel polynomial algorithms to solve the stated problems and present theoretical results on the properties of the solution generated by our approach. Finally, a preliminary experimental evaluation show the applicability of our model with real-world websites.

UR - http://www.scopus.com/inward/record.url?scp=84891944556&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=84891944556&partnerID=8YFLogxK

M3 - Conference contribution

VL - 880

BT - CEUR Workshop Proceedings

ER -