Web page language identification based on urls

Eda Baykan, Monika Henzinger, Ingmar Weber

Research output: Chapter in Book/Report/Conference proceedingChapter

15 Citations (Scopus)

Abstract

Given only the URL of a web page, can we identify its language? This is the question that we examine in this paper. Such a language classifier is, for example, useful for crawlers of web search engines, which frequently try to satisfy certain language quotas. To determine the language of uncrawled web pages, they have to download the page, which might be wasteful, if the page is not in the desired language. With URL-based language classifiers these redundant downloads can be avoided. We apply a variety of machine learning algorithms to the language identification task and evaluate their performance in extensive experiments for five languages: English, French, German, Spanish and Italian. Our best methods achieve an F-measure, averaged over all languages, of around .90 for both a random sample of 1,260 web page from a large web crawl and for 25k pages from the ODP directory. For 5k pages of web search engine results we even achieve an Fmeasure of .96. The achieved recall for these collections is .93, .88 and .95 respectively. Two independent human evaluators performed considerably worse on the task, with an F-measure of .75 and a typical recall of a mere .67. Using only country-code top-level domains, such as .de or .fr yields a good precision, but a typical recall of below .60 and an F-measure of around .68.

Original languageEnglish
Title of host publicationProceedings of the VLDB Endowment
Pages176-187
Number of pages12
Volume1
Edition1
Publication statusPublished - 2008
Externally publishedYes

Fingerprint

Websites
Search engines
Classifiers
World Wide Web
Learning algorithms
Learning systems
Identification (control systems)
Experiments

ASJC Scopus subject areas

  • Computer Science (miscellaneous)
  • Computer Science(all)

Cite this

Baykan, E., Henzinger, M., & Weber, I. (2008). Web page language identification based on urls. In Proceedings of the VLDB Endowment (1 ed., Vol. 1, pp. 176-187)

Web page language identification based on urls. / Baykan, Eda; Henzinger, Monika; Weber, Ingmar.

Proceedings of the VLDB Endowment. Vol. 1 1. ed. 2008. p. 176-187.

Research output: Chapter in Book/Report/Conference proceedingChapter

Baykan, E, Henzinger, M & Weber, I 2008, Web page language identification based on urls. in Proceedings of the VLDB Endowment. 1 edn, vol. 1, pp. 176-187.
Baykan E, Henzinger M, Weber I. Web page language identification based on urls. In Proceedings of the VLDB Endowment. 1 ed. Vol. 1. 2008. p. 176-187
Baykan, Eda ; Henzinger, Monika ; Weber, Ingmar. / Web page language identification based on urls. Proceedings of the VLDB Endowment. Vol. 1 1. ed. 2008. pp. 176-187
@inbook{0ea66ca87e3a490a9d08e8fea1ff100f,
title = "Web page language identification based on urls",
abstract = "Given only the URL of a web page, can we identify its language? This is the question that we examine in this paper. Such a language classifier is, for example, useful for crawlers of web search engines, which frequently try to satisfy certain language quotas. To determine the language of uncrawled web pages, they have to download the page, which might be wasteful, if the page is not in the desired language. With URL-based language classifiers these redundant downloads can be avoided. We apply a variety of machine learning algorithms to the language identification task and evaluate their performance in extensive experiments for five languages: English, French, German, Spanish and Italian. Our best methods achieve an F-measure, averaged over all languages, of around .90 for both a random sample of 1,260 web page from a large web crawl and for 25k pages from the ODP directory. For 5k pages of web search engine results we even achieve an Fmeasure of .96. The achieved recall for these collections is .93, .88 and .95 respectively. Two independent human evaluators performed considerably worse on the task, with an F-measure of .75 and a typical recall of a mere .67. Using only country-code top-level domains, such as .de or .fr yields a good precision, but a typical recall of below .60 and an F-measure of around .68.",
author = "Eda Baykan and Monika Henzinger and Ingmar Weber",
year = "2008",
language = "English",
volume = "1",
pages = "176--187",
booktitle = "Proceedings of the VLDB Endowment",
edition = "1",

}

TY - CHAP

T1 - Web page language identification based on urls

AU - Baykan, Eda

AU - Henzinger, Monika

AU - Weber, Ingmar

PY - 2008

Y1 - 2008

N2 - Given only the URL of a web page, can we identify its language? This is the question that we examine in this paper. Such a language classifier is, for example, useful for crawlers of web search engines, which frequently try to satisfy certain language quotas. To determine the language of uncrawled web pages, they have to download the page, which might be wasteful, if the page is not in the desired language. With URL-based language classifiers these redundant downloads can be avoided. We apply a variety of machine learning algorithms to the language identification task and evaluate their performance in extensive experiments for five languages: English, French, German, Spanish and Italian. Our best methods achieve an F-measure, averaged over all languages, of around .90 for both a random sample of 1,260 web page from a large web crawl and for 25k pages from the ODP directory. For 5k pages of web search engine results we even achieve an Fmeasure of .96. The achieved recall for these collections is .93, .88 and .95 respectively. Two independent human evaluators performed considerably worse on the task, with an F-measure of .75 and a typical recall of a mere .67. Using only country-code top-level domains, such as .de or .fr yields a good precision, but a typical recall of below .60 and an F-measure of around .68.

AB - Given only the URL of a web page, can we identify its language? This is the question that we examine in this paper. Such a language classifier is, for example, useful for crawlers of web search engines, which frequently try to satisfy certain language quotas. To determine the language of uncrawled web pages, they have to download the page, which might be wasteful, if the page is not in the desired language. With URL-based language classifiers these redundant downloads can be avoided. We apply a variety of machine learning algorithms to the language identification task and evaluate their performance in extensive experiments for five languages: English, French, German, Spanish and Italian. Our best methods achieve an F-measure, averaged over all languages, of around .90 for both a random sample of 1,260 web page from a large web crawl and for 25k pages from the ODP directory. For 5k pages of web search engine results we even achieve an Fmeasure of .96. The achieved recall for these collections is .93, .88 and .95 respectively. Two independent human evaluators performed considerably worse on the task, with an F-measure of .75 and a typical recall of a mere .67. Using only country-code top-level domains, such as .de or .fr yields a good precision, but a typical recall of below .60 and an F-measure of around .68.

UR - http://www.scopus.com/inward/record.url?scp=78149404935&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=78149404935&partnerID=8YFLogxK

M3 - Chapter

AN - SCOPUS:78149404935

VL - 1

SP - 176

EP - 187

BT - Proceedings of the VLDB Endowment

ER -