Extracci0́n de corpus paralelos de la Wikipedia basada en la obtención de alineamientos bilingües a nivel de frase

Translated title of the contribution: Extracting parallel corpora from wikipedia on the basis of phrase level bilingual alignment

Joan Albert Silvestre-CerdÀ, Mercedes García-Martínez, Alberto Barrón-Cedeño, Jorge Civera, Paolo Rosso

Research output: Contribution to journalConference article


This paper presents a proposal for extracting parallel corpora from Wikipedia on the basis of statistical machine translation techniques. We have used word-level alignment models from IBM in order to obtain phrase-level bilingual alignments between documents pairs. We have manually annotated a set of test English-Spanish comparable documents in order to evaluate the model. The obtained results are encouraging.

Original languageSpanish
Pages (from-to)14-21
Number of pages8
JournalCEUR Workshop Proceedings
Publication statusPublished - 1 Dec 2011
EventWorkshop on Iberian Cross-Language Natural Language Processing Tasks, ICL 2011 - Huelva, Spain
Duration: 7 Sep 20117 Sep 2011


  • Comparable corpora
  • Parallel sentences extraction
  • Statistical machine translation

ASJC Scopus subject areas

  • Computer Science(all)

Cite this