Automatic extraction of textual elements from news web pages

Hossam Ibrahim, Kareem Darwish, Abdel Rahim Abdel-Sabor

Research output: Chapter in Book/Report/Conference proceedingConference contribution

3 Citations (Scopus)

Abstract

In this paper we present an algorithm for automatic extraction of textual elements, namely titles and full text, associated with news stories in news web pages. We propose a supervised machine learning classification technique based on the use of a Support Vector Machine (SVM) classifier to extract the desired textual elements. The technique uses internal structural features of a webpage without relying on the Document Object Model to which many content authors fail to adhere. The classifier uses a set of features which rely on the length of text, the percentage of hypertext, etc. The resulting classifier is nearly perfect on previously unseen news pages from different sites. The proposed technique is successfully employed in Alzoa.com, which is the largest Arabic news aggregator on the web.

Original languageEnglish
Title of host publicationProceedings of the 6th International Conference on Language Resources and Evaluation, LREC 2008
PublisherEuropean Language Resources Association (ELRA)
Pages1600-1603
Number of pages4
ISBN (Electronic)2951740840, 9782951740846
Publication statusPublished - 1 Jan 2008
Externally publishedYes
Event6th International Conference on Language Resources and Evaluation, LREC 2008 - Marrakech, Morocco
Duration: 28 May 200830 May 2008

Other

Other6th International Conference on Language Resources and Evaluation, LREC 2008
CountryMorocco
CityMarrakech
Period28/5/0830/5/08

Fingerprint

news
hypertext
World Wide Web
News
Classifier
learning
Length
Hypertext
News Stories
Support Vector Machine
Machine Learning

ASJC Scopus subject areas

  • Library and Information Sciences
  • Linguistics and Language
  • Language and Linguistics
  • Education

Cite this

Ibrahim, H., Darwish, K., & Abdel-Sabor, A. R. (2008). Automatic extraction of textual elements from news web pages. In Proceedings of the 6th International Conference on Language Resources and Evaluation, LREC 2008 (pp. 1600-1603). European Language Resources Association (ELRA).

Automatic extraction of textual elements from news web pages. / Ibrahim, Hossam; Darwish, Kareem; Abdel-Sabor, Abdel Rahim.

Proceedings of the 6th International Conference on Language Resources and Evaluation, LREC 2008. European Language Resources Association (ELRA), 2008. p. 1600-1603.

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Ibrahim, H, Darwish, K & Abdel-Sabor, AR 2008, Automatic extraction of textual elements from news web pages. in Proceedings of the 6th International Conference on Language Resources and Evaluation, LREC 2008. European Language Resources Association (ELRA), pp. 1600-1603, 6th International Conference on Language Resources and Evaluation, LREC 2008, Marrakech, Morocco, 28/5/08.
Ibrahim H, Darwish K, Abdel-Sabor AR. Automatic extraction of textual elements from news web pages. In Proceedings of the 6th International Conference on Language Resources and Evaluation, LREC 2008. European Language Resources Association (ELRA). 2008. p. 1600-1603
Ibrahim, Hossam ; Darwish, Kareem ; Abdel-Sabor, Abdel Rahim. / Automatic extraction of textual elements from news web pages. Proceedings of the 6th International Conference on Language Resources and Evaluation, LREC 2008. European Language Resources Association (ELRA), 2008. pp. 1600-1603
@inproceedings{99460847ac0c4f368f06fad2b4f9594e,
title = "Automatic extraction of textual elements from news web pages",
abstract = "In this paper we present an algorithm for automatic extraction of textual elements, namely titles and full text, associated with news stories in news web pages. We propose a supervised machine learning classification technique based on the use of a Support Vector Machine (SVM) classifier to extract the desired textual elements. The technique uses internal structural features of a webpage without relying on the Document Object Model to which many content authors fail to adhere. The classifier uses a set of features which rely on the length of text, the percentage of hypertext, etc. The resulting classifier is nearly perfect on previously unseen news pages from different sites. The proposed technique is successfully employed in Alzoa.com, which is the largest Arabic news aggregator on the web.",
author = "Hossam Ibrahim and Kareem Darwish and Abdel-Sabor, {Abdel Rahim}",
year = "2008",
month = "1",
day = "1",
language = "English",
pages = "1600--1603",
booktitle = "Proceedings of the 6th International Conference on Language Resources and Evaluation, LREC 2008",
publisher = "European Language Resources Association (ELRA)",

}

TY - GEN

T1 - Automatic extraction of textual elements from news web pages

AU - Ibrahim, Hossam

AU - Darwish, Kareem

AU - Abdel-Sabor, Abdel Rahim

PY - 2008/1/1

Y1 - 2008/1/1

N2 - In this paper we present an algorithm for automatic extraction of textual elements, namely titles and full text, associated with news stories in news web pages. We propose a supervised machine learning classification technique based on the use of a Support Vector Machine (SVM) classifier to extract the desired textual elements. The technique uses internal structural features of a webpage without relying on the Document Object Model to which many content authors fail to adhere. The classifier uses a set of features which rely on the length of text, the percentage of hypertext, etc. The resulting classifier is nearly perfect on previously unseen news pages from different sites. The proposed technique is successfully employed in Alzoa.com, which is the largest Arabic news aggregator on the web.

AB - In this paper we present an algorithm for automatic extraction of textual elements, namely titles and full text, associated with news stories in news web pages. We propose a supervised machine learning classification technique based on the use of a Support Vector Machine (SVM) classifier to extract the desired textual elements. The technique uses internal structural features of a webpage without relying on the Document Object Model to which many content authors fail to adhere. The classifier uses a set of features which rely on the length of text, the percentage of hypertext, etc. The resulting classifier is nearly perfect on previously unseen news pages from different sites. The proposed technique is successfully employed in Alzoa.com, which is the largest Arabic news aggregator on the web.

UR - http://www.scopus.com/inward/record.url?scp=85017520837&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=85017520837&partnerID=8YFLogxK

M3 - Conference contribution

AN - SCOPUS:85017520837

SP - 1600

EP - 1603

BT - Proceedings of the 6th International Conference on Language Resources and Evaluation, LREC 2008

PB - European Language Resources Association (ELRA)

ER -