Automatic identification of informative sections of web pages

Sandip Debnath, Prasenjit Mitra, Nirmal Pal, C. Lee Giles

Research output: Contribution to journalArticle

68 Citations (Scopus)

Abstract

Web pages-especially dynamically generated ones-contain several items that cannot be classified as the "primary content," e.g., navigation sidebars, advertisements, copyright notices, etc. Most clients and end-users search for the primary content, and largely do not seek the noninformative content. A tool that assists an end-user or application to search and process information from Web pages automatically, must separate the "primary content sections" from the other content sections. We call these sections as "Web page blocks" or just "blocks." First, a tool must segment the Web pages into Web page blocks and, second, the tool must separate the primary content blocks from the noninformative content blocks. In this paper, we formally define Web page blocks and devise a new algorithm to partition an HTML page into constituent Web page blocks. We then propose four new algorithms, ContentExtractor, FeatureExtractor, K-FeatureExtractor, and L-Extractor. These algorithms identify primary content blocks by 1) looking for blocks that do not occur a large number of times across Web pages, by 2) looking for blocks with desired features, and by 3) using classifiers, trained with block-features, respectively. While operating on several thousand Web pages obtained from various Web sites, our algorithms outperform several existing algorithms with respect to runtime and/or accuracy. Furthermore, we show that a Web cache system that applies our algorithms to remove noninformative content blocks and to identify similar blocks across Web pages can achieve significant storage savings.

Original languageEnglish
Pages (from-to)1233-1246
Number of pages14
JournalIEEE Transactions on Knowledge and Data Engineering
Volume17
Issue number9
DOIs
Publication statusPublished - Sep 2005
Externally publishedYes

Fingerprint

Websites
HTML
Navigation
Classifiers

Keywords

  • Data mining
  • Feature extraction or construction
  • Informative block
  • Inverse block document frequency
  • Text mining
  • Web mining
  • Web page block

ASJC Scopus subject areas

  • Control and Systems Engineering
  • Electrical and Electronic Engineering
  • Artificial Intelligence
  • Information Systems

Cite this

Automatic identification of informative sections of web pages. / Debnath, Sandip; Mitra, Prasenjit; Pal, Nirmal; Giles, C. Lee.

In: IEEE Transactions on Knowledge and Data Engineering, Vol. 17, No. 9, 09.2005, p. 1233-1246.

Research output: Contribution to journalArticle

Debnath, Sandip ; Mitra, Prasenjit ; Pal, Nirmal ; Giles, C. Lee. / Automatic identification of informative sections of web pages. In: IEEE Transactions on Knowledge and Data Engineering. 2005 ; Vol. 17, No. 9. pp. 1233-1246.
@article{e7b6948a832c4185976a9fe2fb27c821,
title = "Automatic identification of informative sections of web pages",
abstract = "Web pages-especially dynamically generated ones-contain several items that cannot be classified as the {"}primary content,{"} e.g., navigation sidebars, advertisements, copyright notices, etc. Most clients and end-users search for the primary content, and largely do not seek the noninformative content. A tool that assists an end-user or application to search and process information from Web pages automatically, must separate the {"}primary content sections{"} from the other content sections. We call these sections as {"}Web page blocks{"} or just {"}blocks.{"} First, a tool must segment the Web pages into Web page blocks and, second, the tool must separate the primary content blocks from the noninformative content blocks. In this paper, we formally define Web page blocks and devise a new algorithm to partition an HTML page into constituent Web page blocks. We then propose four new algorithms, ContentExtractor, FeatureExtractor, K-FeatureExtractor, and L-Extractor. These algorithms identify primary content blocks by 1) looking for blocks that do not occur a large number of times across Web pages, by 2) looking for blocks with desired features, and by 3) using classifiers, trained with block-features, respectively. While operating on several thousand Web pages obtained from various Web sites, our algorithms outperform several existing algorithms with respect to runtime and/or accuracy. Furthermore, we show that a Web cache system that applies our algorithms to remove noninformative content blocks and to identify similar blocks across Web pages can achieve significant storage savings.",
keywords = "Data mining, Feature extraction or construction, Informative block, Inverse block document frequency, Text mining, Web mining, Web page block",
author = "Sandip Debnath and Prasenjit Mitra and Nirmal Pal and Giles, {C. Lee}",
year = "2005",
month = "9",
doi = "10.1109/TKDE.2005.138",
language = "English",
volume = "17",
pages = "1233--1246",
journal = "IEEE Transactions on Knowledge and Data Engineering",
issn = "1041-4347",
publisher = "IEEE Computer Society",
number = "9",

}

TY - JOUR

T1 - Automatic identification of informative sections of web pages

AU - Debnath, Sandip

AU - Mitra, Prasenjit

AU - Pal, Nirmal

AU - Giles, C. Lee

PY - 2005/9

Y1 - 2005/9

N2 - Web pages-especially dynamically generated ones-contain several items that cannot be classified as the "primary content," e.g., navigation sidebars, advertisements, copyright notices, etc. Most clients and end-users search for the primary content, and largely do not seek the noninformative content. A tool that assists an end-user or application to search and process information from Web pages automatically, must separate the "primary content sections" from the other content sections. We call these sections as "Web page blocks" or just "blocks." First, a tool must segment the Web pages into Web page blocks and, second, the tool must separate the primary content blocks from the noninformative content blocks. In this paper, we formally define Web page blocks and devise a new algorithm to partition an HTML page into constituent Web page blocks. We then propose four new algorithms, ContentExtractor, FeatureExtractor, K-FeatureExtractor, and L-Extractor. These algorithms identify primary content blocks by 1) looking for blocks that do not occur a large number of times across Web pages, by 2) looking for blocks with desired features, and by 3) using classifiers, trained with block-features, respectively. While operating on several thousand Web pages obtained from various Web sites, our algorithms outperform several existing algorithms with respect to runtime and/or accuracy. Furthermore, we show that a Web cache system that applies our algorithms to remove noninformative content blocks and to identify similar blocks across Web pages can achieve significant storage savings.

AB - Web pages-especially dynamically generated ones-contain several items that cannot be classified as the "primary content," e.g., navigation sidebars, advertisements, copyright notices, etc. Most clients and end-users search for the primary content, and largely do not seek the noninformative content. A tool that assists an end-user or application to search and process information from Web pages automatically, must separate the "primary content sections" from the other content sections. We call these sections as "Web page blocks" or just "blocks." First, a tool must segment the Web pages into Web page blocks and, second, the tool must separate the primary content blocks from the noninformative content blocks. In this paper, we formally define Web page blocks and devise a new algorithm to partition an HTML page into constituent Web page blocks. We then propose four new algorithms, ContentExtractor, FeatureExtractor, K-FeatureExtractor, and L-Extractor. These algorithms identify primary content blocks by 1) looking for blocks that do not occur a large number of times across Web pages, by 2) looking for blocks with desired features, and by 3) using classifiers, trained with block-features, respectively. While operating on several thousand Web pages obtained from various Web sites, our algorithms outperform several existing algorithms with respect to runtime and/or accuracy. Furthermore, we show that a Web cache system that applies our algorithms to remove noninformative content blocks and to identify similar blocks across Web pages can achieve significant storage savings.

KW - Data mining

KW - Feature extraction or construction

KW - Informative block

KW - Inverse block document frequency

KW - Text mining

KW - Web mining

KW - Web page block

UR - http://www.scopus.com/inward/record.url?scp=33947621611&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=33947621611&partnerID=8YFLogxK

U2 - 10.1109/TKDE.2005.138

DO - 10.1109/TKDE.2005.138

M3 - Article

AN - SCOPUS:33947621611

VL - 17

SP - 1233

EP - 1246

JO - IEEE Transactions on Knowledge and Data Engineering

JF - IEEE Transactions on Knowledge and Data Engineering

SN - 1041-4347

IS - 9

ER -