A study of using search engine page hits as a proxy for n-gram frequencies

Preslav Nakov, Marti Hearst

Research output: Chapter in Book/Report/Conference proceedingConference contribution

16 Citations (Scopus)

Abstract

The idea of using the Web as a corpus for linguistic research is getting increasingly popular. Most often this means using Web search engine page hit counts as estimates for n-gram frequencies. While the results so far have been very encouraging, some researchers worry about what appears to be the instability of these estimates. Using a particular NLP task, we compare the variability in the n-gram counts across different search engines as well as for the same search engine across time, finding that although there are measurable differences, they are not statistically significantly different for the task examined.

Original languageEnglish
Title of host publicationInternational Conference Recent Advances in Natural Language Processing, RANLP
PublisherAssociation for Computational Linguistics (ACL)
Pages347-353
Number of pages7
Volume2005-January
ISBN (Print)9549174336
Publication statusPublished - 2005
Externally publishedYes
EventInternational Conference on Recent Advances in Natural Language Processing, RANLP 2005 - Borovets, Bulgaria
Duration: 21 Sep 200523 Sep 2005

Other

OtherInternational Conference on Recent Advances in Natural Language Processing, RANLP 2005
CountryBulgaria
CityBorovets
Period21/9/0523/9/05

Fingerprint

Search engines
Linguistics

ASJC Scopus subject areas

  • Artificial Intelligence
  • Computer Science Applications
  • Software
  • Electrical and Electronic Engineering

Cite this

Nakov, P., & Hearst, M. (2005). A study of using search engine page hits as a proxy for n-gram frequencies. In International Conference Recent Advances in Natural Language Processing, RANLP (Vol. 2005-January, pp. 347-353). Association for Computational Linguistics (ACL).

A study of using search engine page hits as a proxy for n-gram frequencies. / Nakov, Preslav; Hearst, Marti.

International Conference Recent Advances in Natural Language Processing, RANLP. Vol. 2005-January Association for Computational Linguistics (ACL), 2005. p. 347-353.

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Nakov, P & Hearst, M 2005, A study of using search engine page hits as a proxy for n-gram frequencies. in International Conference Recent Advances in Natural Language Processing, RANLP. vol. 2005-January, Association for Computational Linguistics (ACL), pp. 347-353, International Conference on Recent Advances in Natural Language Processing, RANLP 2005, Borovets, Bulgaria, 21/9/05.
Nakov P, Hearst M. A study of using search engine page hits as a proxy for n-gram frequencies. In International Conference Recent Advances in Natural Language Processing, RANLP. Vol. 2005-January. Association for Computational Linguistics (ACL). 2005. p. 347-353
Nakov, Preslav ; Hearst, Marti. / A study of using search engine page hits as a proxy for n-gram frequencies. International Conference Recent Advances in Natural Language Processing, RANLP. Vol. 2005-January Association for Computational Linguistics (ACL), 2005. pp. 347-353
@inproceedings{3972702df59d4bef904240ecccc440ac,
title = "A study of using search engine page hits as a proxy for n-gram frequencies",
abstract = "The idea of using the Web as a corpus for linguistic research is getting increasingly popular. Most often this means using Web search engine page hit counts as estimates for n-gram frequencies. While the results so far have been very encouraging, some researchers worry about what appears to be the instability of these estimates. Using a particular NLP task, we compare the variability in the n-gram counts across different search engines as well as for the same search engine across time, finding that although there are measurable differences, they are not statistically significantly different for the task examined.",
author = "Preslav Nakov and Marti Hearst",
year = "2005",
language = "English",
isbn = "9549174336",
volume = "2005-January",
pages = "347--353",
booktitle = "International Conference Recent Advances in Natural Language Processing, RANLP",
publisher = "Association for Computational Linguistics (ACL)",

}

TY - GEN

T1 - A study of using search engine page hits as a proxy for n-gram frequencies

AU - Nakov, Preslav

AU - Hearst, Marti

PY - 2005

Y1 - 2005

N2 - The idea of using the Web as a corpus for linguistic research is getting increasingly popular. Most often this means using Web search engine page hit counts as estimates for n-gram frequencies. While the results so far have been very encouraging, some researchers worry about what appears to be the instability of these estimates. Using a particular NLP task, we compare the variability in the n-gram counts across different search engines as well as for the same search engine across time, finding that although there are measurable differences, they are not statistically significantly different for the task examined.

AB - The idea of using the Web as a corpus for linguistic research is getting increasingly popular. Most often this means using Web search engine page hit counts as estimates for n-gram frequencies. While the results so far have been very encouraging, some researchers worry about what appears to be the instability of these estimates. Using a particular NLP task, we compare the variability in the n-gram counts across different search engines as well as for the same search engine across time, finding that although there are measurable differences, they are not statistically significantly different for the task examined.

UR - http://www.scopus.com/inward/record.url?scp=84962711699&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=84962711699&partnerID=8YFLogxK

M3 - Conference contribution

SN - 9549174336

VL - 2005-January

SP - 347

EP - 353

BT - International Conference Recent Advances in Natural Language Processing, RANLP

PB - Association for Computational Linguistics (ACL)

ER -