A comprehensive study of features and algorithms for URL-based topic classification

Eda Baykan, Monika Henzinger, Ludmila Marian, Ingmar Weber

Research output: Contribution to journalArticle

36 Citations (Scopus)

Abstract

Given only the URL of a Web page, can we identify its topic? We study this problem in detail by exploring a large number of different feature sets and algorithms on several datasets. We also show that the inherent overlap between topics and the sparsity of the information in URLs makes this a very challenging problem. Web page classification without a page's content is desirable when the content is not available at all, when a classification is needed before obtaining the content, or when classification speed is of utmost importance. For our experiments we used five different corpora comprising a total of about 3 million (URL, classification) pairs. We evaluated several techniques for feature generation and classification algorithms. The individual binary classifiers were then combined via boosting intometabinary classifiers. We achieve typical F-measure values between 80 and 85, and a typical precision of around 86. The precision can be pushed further over 90 while maintaining a typical level of recall between 30 and 40.

Original languageEnglish
Article number15
JournalACM Transactions on the Web
Volume5
Issue number3
DOIs
Publication statusPublished - 1 Jul 2011
Externally publishedYes

Fingerprint

Websites
Classifiers
Experiments

Keywords

  • ODP
  • Topic classification
  • URL

ASJC Scopus subject areas

  • Computer Networks and Communications

Cite this

A comprehensive study of features and algorithms for URL-based topic classification. / Baykan, Eda; Henzinger, Monika; Marian, Ludmila; Weber, Ingmar.

In: ACM Transactions on the Web, Vol. 5, No. 3, 15, 01.07.2011.

Research output: Contribution to journalArticle

@article{c3dfc6ce1b534f7183fd1bdc7c30d096,
title = "A comprehensive study of features and algorithms for URL-based topic classification",
abstract = "Given only the URL of a Web page, can we identify its topic? We study this problem in detail by exploring a large number of different feature sets and algorithms on several datasets. We also show that the inherent overlap between topics and the sparsity of the information in URLs makes this a very challenging problem. Web page classification without a page's content is desirable when the content is not available at all, when a classification is needed before obtaining the content, or when classification speed is of utmost importance. For our experiments we used five different corpora comprising a total of about 3 million (URL, classification) pairs. We evaluated several techniques for feature generation and classification algorithms. The individual binary classifiers were then combined via boosting intometabinary classifiers. We achieve typical F-measure values between 80 and 85, and a typical precision of around 86. The precision can be pushed further over 90 while maintaining a typical level of recall between 30 and 40.",
keywords = "ODP, Topic classification, URL",
author = "Eda Baykan and Monika Henzinger and Ludmila Marian and Ingmar Weber",
year = "2011",
month = "7",
day = "1",
doi = "10.1145/1993053.1993057",
language = "English",
volume = "5",
journal = "ACM Transactions on the Web",
issn = "1559-1131",
publisher = "Association for Computing Machinery (ACM)",
number = "3",

}

TY - JOUR

T1 - A comprehensive study of features and algorithms for URL-based topic classification

AU - Baykan, Eda

AU - Henzinger, Monika

AU - Marian, Ludmila

AU - Weber, Ingmar

PY - 2011/7/1

Y1 - 2011/7/1

N2 - Given only the URL of a Web page, can we identify its topic? We study this problem in detail by exploring a large number of different feature sets and algorithms on several datasets. We also show that the inherent overlap between topics and the sparsity of the information in URLs makes this a very challenging problem. Web page classification without a page's content is desirable when the content is not available at all, when a classification is needed before obtaining the content, or when classification speed is of utmost importance. For our experiments we used five different corpora comprising a total of about 3 million (URL, classification) pairs. We evaluated several techniques for feature generation and classification algorithms. The individual binary classifiers were then combined via boosting intometabinary classifiers. We achieve typical F-measure values between 80 and 85, and a typical precision of around 86. The precision can be pushed further over 90 while maintaining a typical level of recall between 30 and 40.

AB - Given only the URL of a Web page, can we identify its topic? We study this problem in detail by exploring a large number of different feature sets and algorithms on several datasets. We also show that the inherent overlap between topics and the sparsity of the information in URLs makes this a very challenging problem. Web page classification without a page's content is desirable when the content is not available at all, when a classification is needed before obtaining the content, or when classification speed is of utmost importance. For our experiments we used five different corpora comprising a total of about 3 million (URL, classification) pairs. We evaluated several techniques for feature generation and classification algorithms. The individual binary classifiers were then combined via boosting intometabinary classifiers. We achieve typical F-measure values between 80 and 85, and a typical precision of around 86. The precision can be pushed further over 90 while maintaining a typical level of recall between 30 and 40.

KW - ODP

KW - Topic classification

KW - URL

UR - http://www.scopus.com/inward/record.url?scp=80051944589&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=80051944589&partnerID=8YFLogxK

U2 - 10.1145/1993053.1993057

DO - 10.1145/1993053.1993057

M3 - Article

VL - 5

JO - ACM Transactions on the Web

JF - ACM Transactions on the Web

SN - 1559-1131

IS - 3

M1 - 15

ER -