A study of using search engine page hits as a proxy for n-gram frequencies

Preslav Nakov, Marti Hearst

Research output: Chapter in Book/Report/Conference proceedingConference contribution

16 Citations (Scopus)

Abstract

The idea of using the Web as a corpus for linguistic research is getting increasingly popular. Most often this means using Web search engine page hit counts as estimates for n-gram frequencies. While the results so far have been very encouraging, some researchers worry about what appears to be the instability of these estimates. Using a particular NLP task, we compare the variability in the n-gram counts across different search engines as well as for the same search engine across time, finding that although there are measurable differences, they are not statistically significantly different for the task examined.

Original languageEnglish
Title of host publicationInternational Conference Recent Advances in Natural Language Processing, RANLP
PublisherAssociation for Computational Linguistics (ACL)
Pages347-353
Number of pages7
Volume2005-January
ISBN (Print)9549174336
Publication statusPublished - 2005
Externally publishedYes
EventInternational Conference on Recent Advances in Natural Language Processing, RANLP 2005 - Borovets, Bulgaria
Duration: 21 Sep 200523 Sep 2005

Other

OtherInternational Conference on Recent Advances in Natural Language Processing, RANLP 2005
CountryBulgaria
CityBorovets
Period21/9/0523/9/05

    Fingerprint

ASJC Scopus subject areas

  • Artificial Intelligence
  • Computer Science Applications
  • Software
  • Electrical and Electronic Engineering

Cite this

Nakov, P., & Hearst, M. (2005). A study of using search engine page hits as a proxy for n-gram frequencies. In International Conference Recent Advances in Natural Language Processing, RANLP (Vol. 2005-January, pp. 347-353). Association for Computational Linguistics (ACL).