Automatic extraction of textual elements from news web pages

Hossam Ibrahim, Kareem Darwish, Abdel Rahim Abdel-Sabor

Research output: Chapter in Book/Report/Conference proceedingConference contribution

3 Citations (Scopus)

Abstract

In this paper we present an algorithm for automatic extraction of textual elements, namely titles and full text, associated with news stories in news web pages. We propose a supervised machine learning classification technique based on the use of a Support Vector Machine (SVM) classifier to extract the desired textual elements. The technique uses internal structural features of a webpage without relying on the Document Object Model to which many content authors fail to adhere. The classifier uses a set of features which rely on the length of text, the percentage of hypertext, etc. The resulting classifier is nearly perfect on previously unseen news pages from different sites. The proposed technique is successfully employed in Alzoa.com, which is the largest Arabic news aggregator on the web.

Original languageEnglish
Title of host publicationProceedings of the 6th International Conference on Language Resources and Evaluation, LREC 2008
PublisherEuropean Language Resources Association (ELRA)
Pages1600-1603
Number of pages4
ISBN (Electronic)2951740840, 9782951740846
Publication statusPublished - 1 Jan 2008
Externally publishedYes
Event6th International Conference on Language Resources and Evaluation, LREC 2008 - Marrakech, Morocco
Duration: 28 May 200830 May 2008

Other

Other6th International Conference on Language Resources and Evaluation, LREC 2008
CountryMorocco
CityMarrakech
Period28/5/0830/5/08

    Fingerprint

ASJC Scopus subject areas

  • Library and Information Sciences
  • Linguistics and Language
  • Language and Linguistics
  • Education

Cite this

Ibrahim, H., Darwish, K., & Abdel-Sabor, A. R. (2008). Automatic extraction of textual elements from news web pages. In Proceedings of the 6th International Conference on Language Resources and Evaluation, LREC 2008 (pp. 1600-1603). European Language Resources Association (ELRA).