Identifying content blocks from Web documents

Sandip Debnath, Prasenjit Mitra, C. Lee Giles

Research output: Chapter in Book/Report/Conference proceedingConference contribution

33 Citations (Scopus)

Abstract

Intelligent information processing systems, such as digital libraries or search engines index web-pages according to their informative content. However, web-pages contain several non-informative contents, e.g., navigation sidebars, advertisements, copyright notices, etc. It is very important to separate the informative "primary content blocks" from these non-informative blocks. In this paper, two algorithms, FeatureExtractor and K-FeatureExtractor are proposed to identify the "primary content blocks" based on their features. None of these algorithms require any supervised learning, but still can identify the "primary content blocks" with high precision and recall. While operating on several thousand web-pages obtained from 15 different websites, our algorithms significantly outperform the Entropy-based algorithm proposed by Lin and Ho [14] in both precision and run-time.

Original languageEnglish
Title of host publicationLecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)
Pages285-293
Number of pages9
Volume3488 LNAI
Publication statusPublished - 2005
Externally publishedYes
Event15th International Symposium on Methodologies for Intelligent Systems, ISMIS 2005 - Saratoga Springs, NY
Duration: 25 May 200528 May 2005

Publication series

NameLecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)
Volume3488 LNAI
ISSN (Print)03029743
ISSN (Electronic)16113349

Other

Other15th International Symposium on Methodologies for Intelligent Systems, ISMIS 2005
CitySaratoga Springs, NY
Period25/5/0528/5/05

Fingerprint

Websites
Digital Libraries
Search Engine
Digital libraries
Supervised learning
Entropy
Search engines
Automatic Data Processing
Information Systems
Supervised Learning
Navigation
Information Processing
Learning

Keywords

  • Data Mining
  • Electronic Publishing
  • Information Systems Applications

ASJC Scopus subject areas

  • Computer Science(all)
  • Biochemistry, Genetics and Molecular Biology(all)
  • Theoretical Computer Science

Cite this

Debnath, S., Mitra, P., & Lee Giles, C. (2005). Identifying content blocks from Web documents. In Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics) (Vol. 3488 LNAI, pp. 285-293). (Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics); Vol. 3488 LNAI).

Identifying content blocks from Web documents. / Debnath, Sandip; Mitra, Prasenjit; Lee Giles, C.

Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics). Vol. 3488 LNAI 2005. p. 285-293 (Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics); Vol. 3488 LNAI).

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Debnath, S, Mitra, P & Lee Giles, C 2005, Identifying content blocks from Web documents. in Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics). vol. 3488 LNAI, Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics), vol. 3488 LNAI, pp. 285-293, 15th International Symposium on Methodologies for Intelligent Systems, ISMIS 2005, Saratoga Springs, NY, 25/5/05.
Debnath S, Mitra P, Lee Giles C. Identifying content blocks from Web documents. In Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics). Vol. 3488 LNAI. 2005. p. 285-293. (Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)).
Debnath, Sandip ; Mitra, Prasenjit ; Lee Giles, C. / Identifying content blocks from Web documents. Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics). Vol. 3488 LNAI 2005. pp. 285-293 (Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)).
@inproceedings{fe975b48b171494d9d2c7c774279160f,
title = "Identifying content blocks from Web documents",
abstract = "Intelligent information processing systems, such as digital libraries or search engines index web-pages according to their informative content. However, web-pages contain several non-informative contents, e.g., navigation sidebars, advertisements, copyright notices, etc. It is very important to separate the informative {"}primary content blocks{"} from these non-informative blocks. In this paper, two algorithms, FeatureExtractor and K-FeatureExtractor are proposed to identify the {"}primary content blocks{"} based on their features. None of these algorithms require any supervised learning, but still can identify the {"}primary content blocks{"} with high precision and recall. While operating on several thousand web-pages obtained from 15 different websites, our algorithms significantly outperform the Entropy-based algorithm proposed by Lin and Ho [14] in both precision and run-time.",
keywords = "Data Mining, Electronic Publishing, Information Systems Applications",
author = "Sandip Debnath and Prasenjit Mitra and {Lee Giles}, C.",
year = "2005",
language = "English",
isbn = "3540258787",
volume = "3488 LNAI",
series = "Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)",
pages = "285--293",
booktitle = "Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)",

}

TY - GEN

T1 - Identifying content blocks from Web documents

AU - Debnath, Sandip

AU - Mitra, Prasenjit

AU - Lee Giles, C.

PY - 2005

Y1 - 2005

N2 - Intelligent information processing systems, such as digital libraries or search engines index web-pages according to their informative content. However, web-pages contain several non-informative contents, e.g., navigation sidebars, advertisements, copyright notices, etc. It is very important to separate the informative "primary content blocks" from these non-informative blocks. In this paper, two algorithms, FeatureExtractor and K-FeatureExtractor are proposed to identify the "primary content blocks" based on their features. None of these algorithms require any supervised learning, but still can identify the "primary content blocks" with high precision and recall. While operating on several thousand web-pages obtained from 15 different websites, our algorithms significantly outperform the Entropy-based algorithm proposed by Lin and Ho [14] in both precision and run-time.

AB - Intelligent information processing systems, such as digital libraries or search engines index web-pages according to their informative content. However, web-pages contain several non-informative contents, e.g., navigation sidebars, advertisements, copyright notices, etc. It is very important to separate the informative "primary content blocks" from these non-informative blocks. In this paper, two algorithms, FeatureExtractor and K-FeatureExtractor are proposed to identify the "primary content blocks" based on their features. None of these algorithms require any supervised learning, but still can identify the "primary content blocks" with high precision and recall. While operating on several thousand web-pages obtained from 15 different websites, our algorithms significantly outperform the Entropy-based algorithm proposed by Lin and Ho [14] in both precision and run-time.

KW - Data Mining

KW - Electronic Publishing

KW - Information Systems Applications

UR - http://www.scopus.com/inward/record.url?scp=26944496810&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=26944496810&partnerID=8YFLogxK

M3 - Conference contribution

SN - 3540258787

SN - 9783540258780

VL - 3488 LNAI

T3 - Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)

SP - 285

EP - 293

BT - Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)

ER -