Web as a Corpus: Going beyond the n-gram

Research output: Chapter in Book/Report/Conference proceedingConference contribution

2 Citations (Scopus)

Abstract

The 60-year-old dream of computational linguistics is to make computers capable of communicating with humans in natural language. This has proven hard, and thus research has focused on subproblems. Even so, the field was stuck with manual rules until the early 90s, when computers became powerful enough to enable the rise of statistical approaches. Eventually, this shifted the main research attention to machine learning from text corpora, thus triggering a revolution in the field. Today, the Web is the biggest available corpus, providing access to quadrillions of words; and, in corpus-based natural language processing, size does matter. Unfortunately, while there has been substantial research on the Web as a corpus, it has typically been restricted to using page hit counts as an estimate for n-gram word frequencies; this has led some researchers to conclude that the Web should be only used as a baseline. We show that much better results are possible for structural ambiguity problems, when going beyond the n-gram.

Original languageEnglish
Title of host publicationCommunications in Computer and Information Science
PublisherSpringer Verlag
Pages185-228
Number of pages44
Volume505
ISBN (Print)9783319254845
DOIs
Publication statusPublished - 2015
Event8th Russian Summer School on Information Retrieval, RuSSIR 2014 - Nizhniy, Novgorod, Russian Federation
Duration: 18 Aug 201422 Aug 2014

Publication series

NameCommunications in Computer and Information Science
Volume505
ISSN (Print)18650929

Other

Other8th Russian Summer School on Information Retrieval, RuSSIR 2014
CountryRussian Federation
CityNizhniy, Novgorod
Period18/8/1422/8/14

Fingerprint

Computational linguistics
Learning systems
Processing

Keywords

  • Noun compound bracketing
  • Noun phrase coordination
  • Paraphrases
  • Prepositional phrase attachment
  • Surface features
  • Syntactic parsing
  • Web as a corpus

ASJC Scopus subject areas

  • Computer Science(all)

Cite this

Nakov, P. (2015). Web as a Corpus: Going beyond the n-gram. In Communications in Computer and Information Science (Vol. 505, pp. 185-228). (Communications in Computer and Information Science; Vol. 505). Springer Verlag. https://doi.org/10.1007/978-3-319-25485-2_5

Web as a Corpus : Going beyond the n-gram. / Nakov, Preslav.

Communications in Computer and Information Science. Vol. 505 Springer Verlag, 2015. p. 185-228 (Communications in Computer and Information Science; Vol. 505).

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Nakov, P 2015, Web as a Corpus: Going beyond the n-gram. in Communications in Computer and Information Science. vol. 505, Communications in Computer and Information Science, vol. 505, Springer Verlag, pp. 185-228, 8th Russian Summer School on Information Retrieval, RuSSIR 2014, Nizhniy, Novgorod, Russian Federation, 18/8/14. https://doi.org/10.1007/978-3-319-25485-2_5
Nakov P. Web as a Corpus: Going beyond the n-gram. In Communications in Computer and Information Science. Vol. 505. Springer Verlag. 2015. p. 185-228. (Communications in Computer and Information Science). https://doi.org/10.1007/978-3-319-25485-2_5
Nakov, Preslav. / Web as a Corpus : Going beyond the n-gram. Communications in Computer and Information Science. Vol. 505 Springer Verlag, 2015. pp. 185-228 (Communications in Computer and Information Science).
@inproceedings{411f4d9d3e4a4c97b444d567051ca5e0,
title = "Web as a Corpus: Going beyond the n-gram",
abstract = "The 60-year-old dream of computational linguistics is to make computers capable of communicating with humans in natural language. This has proven hard, and thus research has focused on subproblems. Even so, the field was stuck with manual rules until the early 90s, when computers became powerful enough to enable the rise of statistical approaches. Eventually, this shifted the main research attention to machine learning from text corpora, thus triggering a revolution in the field. Today, the Web is the biggest available corpus, providing access to quadrillions of words; and, in corpus-based natural language processing, size does matter. Unfortunately, while there has been substantial research on the Web as a corpus, it has typically been restricted to using page hit counts as an estimate for n-gram word frequencies; this has led some researchers to conclude that the Web should be only used as a baseline. We show that much better results are possible for structural ambiguity problems, when going beyond the n-gram.",
keywords = "Noun compound bracketing, Noun phrase coordination, Paraphrases, Prepositional phrase attachment, Surface features, Syntactic parsing, Web as a corpus",
author = "Preslav Nakov",
year = "2015",
doi = "10.1007/978-3-319-25485-2_5",
language = "English",
isbn = "9783319254845",
volume = "505",
series = "Communications in Computer and Information Science",
publisher = "Springer Verlag",
pages = "185--228",
booktitle = "Communications in Computer and Information Science",

}

TY - GEN

T1 - Web as a Corpus

T2 - Going beyond the n-gram

AU - Nakov, Preslav

PY - 2015

Y1 - 2015

N2 - The 60-year-old dream of computational linguistics is to make computers capable of communicating with humans in natural language. This has proven hard, and thus research has focused on subproblems. Even so, the field was stuck with manual rules until the early 90s, when computers became powerful enough to enable the rise of statistical approaches. Eventually, this shifted the main research attention to machine learning from text corpora, thus triggering a revolution in the field. Today, the Web is the biggest available corpus, providing access to quadrillions of words; and, in corpus-based natural language processing, size does matter. Unfortunately, while there has been substantial research on the Web as a corpus, it has typically been restricted to using page hit counts as an estimate for n-gram word frequencies; this has led some researchers to conclude that the Web should be only used as a baseline. We show that much better results are possible for structural ambiguity problems, when going beyond the n-gram.

AB - The 60-year-old dream of computational linguistics is to make computers capable of communicating with humans in natural language. This has proven hard, and thus research has focused on subproblems. Even so, the field was stuck with manual rules until the early 90s, when computers became powerful enough to enable the rise of statistical approaches. Eventually, this shifted the main research attention to machine learning from text corpora, thus triggering a revolution in the field. Today, the Web is the biggest available corpus, providing access to quadrillions of words; and, in corpus-based natural language processing, size does matter. Unfortunately, while there has been substantial research on the Web as a corpus, it has typically been restricted to using page hit counts as an estimate for n-gram word frequencies; this has led some researchers to conclude that the Web should be only used as a baseline. We show that much better results are possible for structural ambiguity problems, when going beyond the n-gram.

KW - Noun compound bracketing

KW - Noun phrase coordination

KW - Paraphrases

KW - Prepositional phrase attachment

KW - Surface features

KW - Syntactic parsing

KW - Web as a corpus

UR - http://www.scopus.com/inward/record.url?scp=84951790321&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=84951790321&partnerID=8YFLogxK

U2 - 10.1007/978-3-319-25485-2_5

DO - 10.1007/978-3-319-25485-2_5

M3 - Conference contribution

AN - SCOPUS:84951790321

SN - 9783319254845

VL - 505

T3 - Communications in Computer and Information Science

SP - 185

EP - 228

BT - Communications in Computer and Information Science

PB - Springer Verlag

ER -