Improved word alignments using the Web as a corpus

Preslav Nakov, Svetlin Nakov, Elena Paskaleva

Research output: Chapter in Book/Report/Conference proceedingConference contribution

3 Citations (Scopus)

Abstract

We propose a novel method for improving word alignments in a parallel sentence-aligned bilingual corpus based on the idea that if two words are translations of each other then so should be many words in their local contexts. The idea is formalised using the Web as a corpus, a glossary of known word translations (dynamically augmented from the Web using bootstrapping), the vector space model, linguistically motivated weighted minimum edit distance, competitive linking, and the IBM models. Evaluation results on a Bulgarian-Russian corpus show a sizable improvement both in word alignment and in translation quality.

Original languageEnglish
Title of host publicationInternational Conference Recent Advances in Natural Language Processing, RANLP 2007 - Proceedings
EditorsNicolas Nicolov, Nikolai Nikolov, Ruslan Mitkov, Kalina Bontcheva, Galia Angelova
PublisherAssociation for Computational Linguistics (ACL)
Pages400-405
Number of pages6
ISBN (Electronic)9789549174373
Publication statusPublished - 1 Jan 2007
EventInternational Conference Recent Advances in Natural Language Processing, RANLP 2007 - Borovets, Bulgaria
Duration: 27 Sep 200729 Sep 2007

Publication series

NameInternational Conference Recent Advances in Natural Language Processing, RANLP
Volume2007-January
ISSN (Print)1313-8502

Other

OtherInternational Conference Recent Advances in Natural Language Processing, RANLP 2007
CountryBulgaria
CityBorovets
Period27/9/0729/9/07

    Fingerprint

Keywords

  • Competitive linking
  • Edit distance
  • Machine translation
  • String similarity
  • Web as a corpus
  • Word alignments

ASJC Scopus subject areas

  • Software
  • Computer Science Applications
  • Artificial Intelligence
  • Electrical and Electronic Engineering

Cite this

Nakov, P., Nakov, S., & Paskaleva, E. (2007). Improved word alignments using the Web as a corpus. In N. Nicolov, N. Nikolov, R. Mitkov, K. Bontcheva, & G. Angelova (Eds.), International Conference Recent Advances in Natural Language Processing, RANLP 2007 - Proceedings (pp. 400-405). (International Conference Recent Advances in Natural Language Processing, RANLP; Vol. 2007-January). Association for Computational Linguistics (ACL).