Exploiting information redundancy to wring out structured data from the web

Lorenzo Blanco, Mirko Bronzi, Valter Crescenzi, Paolo Merialdo, Paolo Papotti

Research output: Chapter in Book/Report/Conference proceedingConference contribution

6 Citations (Scopus)

Abstract

A large number of web sites publish pages containing structured information about recognizable concepts, but these data are only partially used by current applications. Although such information is spread across a myriad of sources, the web scale implies a relevant redundancy. We present a domain independent system that exploits the redundancy of information to automatically extract and integrate data from the Web. Our solution concentrates on sources that provide structured data about multiple instances from the same conceptual domain, e.g. financial data, product information. Our proposal is based on an original approach that exploits the mutual dependency between the data extraction and the data integration tasks. Experiments confirmed the quality and the feasibility of the approach.

Original languageEnglish
Title of host publicationProceedings of the 19th International Conference on World Wide Web, WWW '10
Pages1063-1064
Number of pages2
DOIs
Publication statusPublished - 20 Jul 2010
Externally publishedYes
Event19th International World Wide Web Conference, WWW2010 - Raleigh, NC, United States
Duration: 26 Apr 201030 Apr 2010

Other

Other19th International World Wide Web Conference, WWW2010
CountryUnited States
CityRaleigh, NC
Period26/4/1030/4/10

Fingerprint

World Wide Web
Redundancy
Data integration
Websites
Experiments

Keywords

  • data extraction
  • data integration
  • wrapper generation

ASJC Scopus subject areas

  • Computer Networks and Communications
  • Computer Science Applications

Cite this

Blanco, L., Bronzi, M., Crescenzi, V., Merialdo, P., & Papotti, P. (2010). Exploiting information redundancy to wring out structured data from the web. In Proceedings of the 19th International Conference on World Wide Web, WWW '10 (pp. 1063-1064) https://doi.org/10.1145/1772690.1772805

Exploiting information redundancy to wring out structured data from the web. / Blanco, Lorenzo; Bronzi, Mirko; Crescenzi, Valter; Merialdo, Paolo; Papotti, Paolo.

Proceedings of the 19th International Conference on World Wide Web, WWW '10. 2010. p. 1063-1064.

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Blanco, L, Bronzi, M, Crescenzi, V, Merialdo, P & Papotti, P 2010, Exploiting information redundancy to wring out structured data from the web. in Proceedings of the 19th International Conference on World Wide Web, WWW '10. pp. 1063-1064, 19th International World Wide Web Conference, WWW2010, Raleigh, NC, United States, 26/4/10. https://doi.org/10.1145/1772690.1772805
Blanco L, Bronzi M, Crescenzi V, Merialdo P, Papotti P. Exploiting information redundancy to wring out structured data from the web. In Proceedings of the 19th International Conference on World Wide Web, WWW '10. 2010. p. 1063-1064 https://doi.org/10.1145/1772690.1772805
Blanco, Lorenzo ; Bronzi, Mirko ; Crescenzi, Valter ; Merialdo, Paolo ; Papotti, Paolo. / Exploiting information redundancy to wring out structured data from the web. Proceedings of the 19th International Conference on World Wide Web, WWW '10. 2010. pp. 1063-1064
@inproceedings{7e588ee4df044268a8a069d5dd513aa6,
title = "Exploiting information redundancy to wring out structured data from the web",
abstract = "A large number of web sites publish pages containing structured information about recognizable concepts, but these data are only partially used by current applications. Although such information is spread across a myriad of sources, the web scale implies a relevant redundancy. We present a domain independent system that exploits the redundancy of information to automatically extract and integrate data from the Web. Our solution concentrates on sources that provide structured data about multiple instances from the same conceptual domain, e.g. financial data, product information. Our proposal is based on an original approach that exploits the mutual dependency between the data extraction and the data integration tasks. Experiments confirmed the quality and the feasibility of the approach.",
keywords = "data extraction, data integration, wrapper generation",
author = "Lorenzo Blanco and Mirko Bronzi and Valter Crescenzi and Paolo Merialdo and Paolo Papotti",
year = "2010",
month = "7",
day = "20",
doi = "10.1145/1772690.1772805",
language = "English",
isbn = "9781605587998",
pages = "1063--1064",
booktitle = "Proceedings of the 19th International Conference on World Wide Web, WWW '10",

}

TY - GEN

T1 - Exploiting information redundancy to wring out structured data from the web

AU - Blanco, Lorenzo

AU - Bronzi, Mirko

AU - Crescenzi, Valter

AU - Merialdo, Paolo

AU - Papotti, Paolo

PY - 2010/7/20

Y1 - 2010/7/20

N2 - A large number of web sites publish pages containing structured information about recognizable concepts, but these data are only partially used by current applications. Although such information is spread across a myriad of sources, the web scale implies a relevant redundancy. We present a domain independent system that exploits the redundancy of information to automatically extract and integrate data from the Web. Our solution concentrates on sources that provide structured data about multiple instances from the same conceptual domain, e.g. financial data, product information. Our proposal is based on an original approach that exploits the mutual dependency between the data extraction and the data integration tasks. Experiments confirmed the quality and the feasibility of the approach.

AB - A large number of web sites publish pages containing structured information about recognizable concepts, but these data are only partially used by current applications. Although such information is spread across a myriad of sources, the web scale implies a relevant redundancy. We present a domain independent system that exploits the redundancy of information to automatically extract and integrate data from the Web. Our solution concentrates on sources that provide structured data about multiple instances from the same conceptual domain, e.g. financial data, product information. Our proposal is based on an original approach that exploits the mutual dependency between the data extraction and the data integration tasks. Experiments confirmed the quality and the feasibility of the approach.

KW - data extraction

KW - data integration

KW - wrapper generation

UR - http://www.scopus.com/inward/record.url?scp=77954591954&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=77954591954&partnerID=8YFLogxK

U2 - 10.1145/1772690.1772805

DO - 10.1145/1772690.1772805

M3 - Conference contribution

AN - SCOPUS:77954591954

SN - 9781605587998

SP - 1063

EP - 1064

BT - Proceedings of the 19th International Conference on World Wide Web, WWW '10

ER -