A comprehensive study of techniques for URL-based web page language classification

Eda Baykan, Monika Henzinger, Ingmar Weber

Research output: Contribution to journalArticle

11 Citations (Scopus)

Abstract

Given only the URL of a Web page, can we identify its language? In this article we examine this question. URL-based language classification is useful when the content of the Web page is not available or downloading the content is a waste of bandwidth and time. We built URL-based language classifiers for English, German, French, Spanish, and Italian by applying a variety of algorithms and features. As algorithms we used machine learning algorithms which are widely applied for text classification and state-of-art algorithms for language identification of text. As features we used words, various sized n-grams, and custom-made features (our novel feature set). We compared our approaches with two baseline methods, namely classification by country code top-level domains and classification by IP addresses of the hosting Web servers. We trained and tested our classifiers in a 10-fold cross-validation setup on a dataset obtained from the Open Directory Project and from querying a commercial search engine. We obtained the lowest F1-measure for English (94) and the highest F1-measure for German (98) with the best performing classifiers. We also evaluated the performance of our methods: (i) on a set of Web pages written in Adobe Flash and (ii) as part of a language-focused crawler. In the first case, the content of the Web page is hard to extract and in the second page downloading pages of the "wrong" language constitutes a waste of bandwidth. In both settings the best classifiers have a high accuracy with an F1-measure between 95 (for English) and 98 (for Italian) for the Adobe Flash pages and a precision between 90 (for Italian) and 97 (for French) for the language-focused crawler.

Original languageEnglish
Article number3
JournalACM Transactions on the Web
Volume7
Issue number1
DOIs
Publication statusPublished - 1 Mar 2013
Externally publishedYes

Fingerprint

Websites
Classifiers
Bandwidth
Search engines
Learning algorithms
Learning systems
Servers

Keywords

  • Document and text processing
  • Language classification
  • URL
  • Web page classification

ASJC Scopus subject areas

  • Computer Networks and Communications

Cite this

A comprehensive study of techniques for URL-based web page language classification. / Baykan, Eda; Henzinger, Monika; Weber, Ingmar.

In: ACM Transactions on the Web, Vol. 7, No. 1, 3, 01.03.2013.

Research output: Contribution to journalArticle

@article{9f583c53c0db4e3bbbe9afe0987a4af2,
title = "A comprehensive study of techniques for URL-based web page language classification",
abstract = "Given only the URL of a Web page, can we identify its language? In this article we examine this question. URL-based language classification is useful when the content of the Web page is not available or downloading the content is a waste of bandwidth and time. We built URL-based language classifiers for English, German, French, Spanish, and Italian by applying a variety of algorithms and features. As algorithms we used machine learning algorithms which are widely applied for text classification and state-of-art algorithms for language identification of text. As features we used words, various sized n-grams, and custom-made features (our novel feature set). We compared our approaches with two baseline methods, namely classification by country code top-level domains and classification by IP addresses of the hosting Web servers. We trained and tested our classifiers in a 10-fold cross-validation setup on a dataset obtained from the Open Directory Project and from querying a commercial search engine. We obtained the lowest F1-measure for English (94) and the highest F1-measure for German (98) with the best performing classifiers. We also evaluated the performance of our methods: (i) on a set of Web pages written in Adobe Flash and (ii) as part of a language-focused crawler. In the first case, the content of the Web page is hard to extract and in the second page downloading pages of the {"}wrong{"} language constitutes a waste of bandwidth. In both settings the best classifiers have a high accuracy with an F1-measure between 95 (for English) and 98 (for Italian) for the Adobe Flash pages and a precision between 90 (for Italian) and 97 (for French) for the language-focused crawler.",
keywords = "Document and text processing, Language classification, URL, Web page classification",
author = "Eda Baykan and Monika Henzinger and Ingmar Weber",
year = "2013",
month = "3",
day = "1",
doi = "10.1145/2435215.2435218",
language = "English",
volume = "7",
journal = "ACM Transactions on the Web",
issn = "1559-1131",
publisher = "Association for Computing Machinery (ACM)",
number = "1",

}

TY - JOUR

T1 - A comprehensive study of techniques for URL-based web page language classification

AU - Baykan, Eda

AU - Henzinger, Monika

AU - Weber, Ingmar

PY - 2013/3/1

Y1 - 2013/3/1

N2 - Given only the URL of a Web page, can we identify its language? In this article we examine this question. URL-based language classification is useful when the content of the Web page is not available or downloading the content is a waste of bandwidth and time. We built URL-based language classifiers for English, German, French, Spanish, and Italian by applying a variety of algorithms and features. As algorithms we used machine learning algorithms which are widely applied for text classification and state-of-art algorithms for language identification of text. As features we used words, various sized n-grams, and custom-made features (our novel feature set). We compared our approaches with two baseline methods, namely classification by country code top-level domains and classification by IP addresses of the hosting Web servers. We trained and tested our classifiers in a 10-fold cross-validation setup on a dataset obtained from the Open Directory Project and from querying a commercial search engine. We obtained the lowest F1-measure for English (94) and the highest F1-measure for German (98) with the best performing classifiers. We also evaluated the performance of our methods: (i) on a set of Web pages written in Adobe Flash and (ii) as part of a language-focused crawler. In the first case, the content of the Web page is hard to extract and in the second page downloading pages of the "wrong" language constitutes a waste of bandwidth. In both settings the best classifiers have a high accuracy with an F1-measure between 95 (for English) and 98 (for Italian) for the Adobe Flash pages and a precision between 90 (for Italian) and 97 (for French) for the language-focused crawler.

AB - Given only the URL of a Web page, can we identify its language? In this article we examine this question. URL-based language classification is useful when the content of the Web page is not available or downloading the content is a waste of bandwidth and time. We built URL-based language classifiers for English, German, French, Spanish, and Italian by applying a variety of algorithms and features. As algorithms we used machine learning algorithms which are widely applied for text classification and state-of-art algorithms for language identification of text. As features we used words, various sized n-grams, and custom-made features (our novel feature set). We compared our approaches with two baseline methods, namely classification by country code top-level domains and classification by IP addresses of the hosting Web servers. We trained and tested our classifiers in a 10-fold cross-validation setup on a dataset obtained from the Open Directory Project and from querying a commercial search engine. We obtained the lowest F1-measure for English (94) and the highest F1-measure for German (98) with the best performing classifiers. We also evaluated the performance of our methods: (i) on a set of Web pages written in Adobe Flash and (ii) as part of a language-focused crawler. In the first case, the content of the Web page is hard to extract and in the second page downloading pages of the "wrong" language constitutes a waste of bandwidth. In both settings the best classifiers have a high accuracy with an F1-measure between 95 (for English) and 98 (for Italian) for the Adobe Flash pages and a precision between 90 (for Italian) and 97 (for French) for the language-focused crawler.

KW - Document and text processing

KW - Language classification

KW - URL

KW - Web page classification

UR - http://www.scopus.com/inward/record.url?scp=84880239938&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=84880239938&partnerID=8YFLogxK

U2 - 10.1145/2435215.2435218

DO - 10.1145/2435215.2435218

M3 - Article

AN - SCOPUS:84880239938

VL - 7

JO - ACM Transactions on the Web

JF - ACM Transactions on the Web

SN - 1559-1131

IS - 1

M1 - 3

ER -