Web as a Corpus: Going beyond the n-gram

Research output: Chapter in Book/Report/Conference proceedingConference contribution

2 Citations (Scopus)

Abstract

The 60-year-old dream of computational linguistics is to make computers capable of communicating with humans in natural language. This has proven hard, and thus research has focused on subproblems. Even so, the field was stuck with manual rules until the early 90s, when computers became powerful enough to enable the rise of statistical approaches. Eventually, this shifted the main research attention to machine learning from text corpora, thus triggering a revolution in the field. Today, the Web is the biggest available corpus, providing access to quadrillions of words; and, in corpus-based natural language processing, size does matter. Unfortunately, while there has been substantial research on the Web as a corpus, it has typically been restricted to using page hit counts as an estimate for n-gram word frequencies; this has led some researchers to conclude that the Web should be only used as a baseline. We show that much better results are possible for structural ambiguity problems, when going beyond the n-gram.

Original languageEnglish
Title of host publicationCommunications in Computer and Information Science
PublisherSpringer Verlag
Pages185-228
Number of pages44
Volume505
ISBN (Print)9783319254845
DOIs
Publication statusPublished - 2015
Event8th Russian Summer School on Information Retrieval, RuSSIR 2014 - Nizhniy, Novgorod, Russian Federation
Duration: 18 Aug 201422 Aug 2014

Publication series

NameCommunications in Computer and Information Science
Volume505
ISSN (Print)18650929

Other

Other8th Russian Summer School on Information Retrieval, RuSSIR 2014
CountryRussian Federation
CityNizhniy, Novgorod
Period18/8/1422/8/14

    Fingerprint

Keywords

  • Noun compound bracketing
  • Noun phrase coordination
  • Paraphrases
  • Prepositional phrase attachment
  • Surface features
  • Syntactic parsing
  • Web as a corpus

ASJC Scopus subject areas

  • Computer Science(all)

Cite this

Nakov, P. (2015). Web as a Corpus: Going beyond the n-gram. In Communications in Computer and Information Science (Vol. 505, pp. 185-228). (Communications in Computer and Information Science; Vol. 505). Springer Verlag. https://doi.org/10.1007/978-3-319-25485-2_5