Automatic extraction of informative blocks from webpages

Sandip Debnath, Prasenjit Mitra, C. Lee Giles

Research output: Chapter in Book/Report/Conference proceedingConference contribution

46 Citations (Scopus)

Abstract

Search engines crawl and index webpages depending upon their informative content. However, webpages - especially dynamically generated ones - contain items that cannot be classified as the "primary content", e.g., navigation side-bars, advertisements, copyright notices, etc. Most end-users search for the primary content, and largely do not seek the non-informative content. A tool that assists an end-user or application to search and process information from web-pages automatically, must separate the "primary content blocks" from the other blocks. In this paper, two new algorithms, ContentExtractor, and FeatureExtractor are proposed. The algorithms identify primary content blocks by i) looking for blocks that do not occur a large number of times across webpages and ii) looking for blocks with desired features respectively. They identify the primary content blocks with high precision and recall, reduce the storage requirement for search engines, result in smaller indexes and thereby faster search times, and better user satisfaction. While operating on several thousand webpages obtained from 11 news websites, our algorithms significantly outperform the Entropy-based algorithm proposed by Lin and Ho [7] in both accuracy and run-time.

Original languageEnglish
Title of host publicationProceedings of the ACM Symposium on Applied Computing
Pages1722-1726
Number of pages5
Volume2
DOIs
Publication statusPublished - 2005
Externally publishedYes
Event20th Annual ACM Symposium on Applied Computing - Santa Fe, NM
Duration: 13 Mar 200517 Mar 2005

Other

Other20th Annual ACM Symposium on Applied Computing
CitySanta Fe, NM
Period13/3/0517/3/05

Fingerprint

Search engines
Websites
Navigation
Entropy

Keywords

  • Data Mining
  • Electronic Publishing
  • Information Systems Applications

ASJC Scopus subject areas

  • Computer Science(all)

Cite this

Debnath, S., Mitra, P., & Lee Giles, C. (2005). Automatic extraction of informative blocks from webpages. In Proceedings of the ACM Symposium on Applied Computing (Vol. 2, pp. 1722-1726) https://doi.org/10.1145/1066677.1067065

Automatic extraction of informative blocks from webpages. / Debnath, Sandip; Mitra, Prasenjit; Lee Giles, C.

Proceedings of the ACM Symposium on Applied Computing. Vol. 2 2005. p. 1722-1726.

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Debnath, S, Mitra, P & Lee Giles, C 2005, Automatic extraction of informative blocks from webpages. in Proceedings of the ACM Symposium on Applied Computing. vol. 2, pp. 1722-1726, 20th Annual ACM Symposium on Applied Computing, Santa Fe, NM, 13/3/05. https://doi.org/10.1145/1066677.1067065
Debnath S, Mitra P, Lee Giles C. Automatic extraction of informative blocks from webpages. In Proceedings of the ACM Symposium on Applied Computing. Vol. 2. 2005. p. 1722-1726 https://doi.org/10.1145/1066677.1067065
Debnath, Sandip ; Mitra, Prasenjit ; Lee Giles, C. / Automatic extraction of informative blocks from webpages. Proceedings of the ACM Symposium on Applied Computing. Vol. 2 2005. pp. 1722-1726
@inproceedings{5410e7477a4f4e47988e4fbbc5c25aeb,
title = "Automatic extraction of informative blocks from webpages",
abstract = "Search engines crawl and index webpages depending upon their informative content. However, webpages - especially dynamically generated ones - contain items that cannot be classified as the {"}primary content{"}, e.g., navigation side-bars, advertisements, copyright notices, etc. Most end-users search for the primary content, and largely do not seek the non-informative content. A tool that assists an end-user or application to search and process information from web-pages automatically, must separate the {"}primary content blocks{"} from the other blocks. In this paper, two new algorithms, ContentExtractor, and FeatureExtractor are proposed. The algorithms identify primary content blocks by i) looking for blocks that do not occur a large number of times across webpages and ii) looking for blocks with desired features respectively. They identify the primary content blocks with high precision and recall, reduce the storage requirement for search engines, result in smaller indexes and thereby faster search times, and better user satisfaction. While operating on several thousand webpages obtained from 11 news websites, our algorithms significantly outperform the Entropy-based algorithm proposed by Lin and Ho [7] in both accuracy and run-time.",
keywords = "Data Mining, Electronic Publishing, Information Systems Applications",
author = "Sandip Debnath and Prasenjit Mitra and {Lee Giles}, C.",
year = "2005",
doi = "10.1145/1066677.1067065",
language = "English",
volume = "2",
pages = "1722--1726",
booktitle = "Proceedings of the ACM Symposium on Applied Computing",

}

TY - GEN

T1 - Automatic extraction of informative blocks from webpages

AU - Debnath, Sandip

AU - Mitra, Prasenjit

AU - Lee Giles, C.

PY - 2005

Y1 - 2005

N2 - Search engines crawl and index webpages depending upon their informative content. However, webpages - especially dynamically generated ones - contain items that cannot be classified as the "primary content", e.g., navigation side-bars, advertisements, copyright notices, etc. Most end-users search for the primary content, and largely do not seek the non-informative content. A tool that assists an end-user or application to search and process information from web-pages automatically, must separate the "primary content blocks" from the other blocks. In this paper, two new algorithms, ContentExtractor, and FeatureExtractor are proposed. The algorithms identify primary content blocks by i) looking for blocks that do not occur a large number of times across webpages and ii) looking for blocks with desired features respectively. They identify the primary content blocks with high precision and recall, reduce the storage requirement for search engines, result in smaller indexes and thereby faster search times, and better user satisfaction. While operating on several thousand webpages obtained from 11 news websites, our algorithms significantly outperform the Entropy-based algorithm proposed by Lin and Ho [7] in both accuracy and run-time.

AB - Search engines crawl and index webpages depending upon their informative content. However, webpages - especially dynamically generated ones - contain items that cannot be classified as the "primary content", e.g., navigation side-bars, advertisements, copyright notices, etc. Most end-users search for the primary content, and largely do not seek the non-informative content. A tool that assists an end-user or application to search and process information from web-pages automatically, must separate the "primary content blocks" from the other blocks. In this paper, two new algorithms, ContentExtractor, and FeatureExtractor are proposed. The algorithms identify primary content blocks by i) looking for blocks that do not occur a large number of times across webpages and ii) looking for blocks with desired features respectively. They identify the primary content blocks with high precision and recall, reduce the storage requirement for search engines, result in smaller indexes and thereby faster search times, and better user satisfaction. While operating on several thousand webpages obtained from 11 news websites, our algorithms significantly outperform the Entropy-based algorithm proposed by Lin and Ho [7] in both accuracy and run-time.

KW - Data Mining

KW - Electronic Publishing

KW - Information Systems Applications

UR - http://www.scopus.com/inward/record.url?scp=26844469211&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=26844469211&partnerID=8YFLogxK

U2 - 10.1145/1066677.1067065

DO - 10.1145/1066677.1067065

M3 - Conference contribution

VL - 2

SP - 1722

EP - 1726

BT - Proceedings of the ACM Symposium on Applied Computing

ER -