Extracci0́n de corpus paralelos de la Wikipedia basada en la obtención de alineamientos bilingües a nivel de frase

Translated title of the contribution: Extracting parallel corpora from wikipedia on the basis of phrase level bilingual alignment

Joan Albert Silvestre-CerdÀ, Mercedes García-Martínez, Alberto Barron, Jorge Civera, Paolo Rosso

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Abstract

This paper presents a proposal for extracting parallel corpora from Wikipedia on the basis of statistical machine translation techniques. We have used word-level alignment models from IBM in order to obtain phrase-level bilingual alignments between documents pairs. We have manually annotated a set of test English-Spanish comparable documents in order to evaluate the model. The obtained results are encouraging.

Original languageSpanish
Title of host publicationCEUR Workshop Proceedings
Pages14-21
Number of pages8
Volume824
Publication statusPublished - 2011
Externally publishedYes
EventWorkshop on Iberian Cross-Language Natural Language Processing Tasks, ICL 2011 - Huelva, Spain
Duration: 7 Sep 20117 Sep 2011

Other

Other
CountrySpain
CityHuelva
Period7/9/117/9/11

Keywords

  • Comparable corpora
  • Parallel sentences extraction
  • Statistical machine translation

ASJC Scopus subject areas

  • Computer Science(all)

Cite this

Silvestre-CerdÀ, J. A., García-Martínez, M., Barron, A., Civera, J., & Rosso, P. (2011). Extracci0́n de corpus paralelos de la Wikipedia basada en la obtención de alineamientos bilingües a nivel de frase. In CEUR Workshop Proceedings (Vol. 824, pp. 14-21)

Extracci0́n de corpus paralelos de la Wikipedia basada en la obtención de alineamientos bilingües a nivel de frase. / Silvestre-CerdÀ, Joan Albert; García-Martínez, Mercedes; Barron, Alberto; Civera, Jorge; Rosso, Paolo.

CEUR Workshop Proceedings. Vol. 824 2011. p. 14-21.

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Silvestre-CerdÀ, JA, García-Martínez, M, Barron, A, Civera, J & Rosso, P 2011, Extracci0́n de corpus paralelos de la Wikipedia basada en la obtención de alineamientos bilingües a nivel de frase. in CEUR Workshop Proceedings. vol. 824, pp. 14-21, Workshop on Iberian Cross-Language Natural Language Processing Tasks, ICL 2011, Huelva, Spain, 7/9/11.
Silvestre-CerdÀ JA, García-Martínez M, Barron A, Civera J, Rosso P. Extracci0́n de corpus paralelos de la Wikipedia basada en la obtención de alineamientos bilingües a nivel de frase. In CEUR Workshop Proceedings. Vol. 824. 2011. p. 14-21
Silvestre-CerdÀ, Joan Albert ; García-Martínez, Mercedes ; Barron, Alberto ; Civera, Jorge ; Rosso, Paolo. / Extracci0́n de corpus paralelos de la Wikipedia basada en la obtención de alineamientos bilingües a nivel de frase. CEUR Workshop Proceedings. Vol. 824 2011. pp. 14-21
@inproceedings{8c9759b2f16941b8a08b999f4c70db51,
title = "Extracci0́n de corpus paralelos de la Wikipedia basada en la obtenci{\'o}n de alineamientos biling{\"u}es a nivel de frase",
abstract = "This paper presents a proposal for extracting parallel corpora from Wikipedia on the basis of statistical machine translation techniques. We have used word-level alignment models from IBM in order to obtain phrase-level bilingual alignments between documents pairs. We have manually annotated a set of test English-Spanish comparable documents in order to evaluate the model. The obtained results are encouraging.",
keywords = "Comparable corpora, Parallel sentences extraction, Statistical machine translation",
author = "Silvestre-Cerd{\`A}, {Joan Albert} and Mercedes Garc{\'i}a-Mart{\'i}nez and Alberto Barron and Jorge Civera and Paolo Rosso",
year = "2011",
language = "Spanish",
volume = "824",
pages = "14--21",
booktitle = "CEUR Workshop Proceedings",

}

TY - GEN

T1 - Extracci0́n de corpus paralelos de la Wikipedia basada en la obtención de alineamientos bilingües a nivel de frase

AU - Silvestre-CerdÀ, Joan Albert

AU - García-Martínez, Mercedes

AU - Barron, Alberto

AU - Civera, Jorge

AU - Rosso, Paolo

PY - 2011

Y1 - 2011

N2 - This paper presents a proposal for extracting parallel corpora from Wikipedia on the basis of statistical machine translation techniques. We have used word-level alignment models from IBM in order to obtain phrase-level bilingual alignments between documents pairs. We have manually annotated a set of test English-Spanish comparable documents in order to evaluate the model. The obtained results are encouraging.

AB - This paper presents a proposal for extracting parallel corpora from Wikipedia on the basis of statistical machine translation techniques. We have used word-level alignment models from IBM in order to obtain phrase-level bilingual alignments between documents pairs. We have manually annotated a set of test English-Spanish comparable documents in order to evaluate the model. The obtained results are encouraging.

KW - Comparable corpora

KW - Parallel sentences extraction

KW - Statistical machine translation

UR - http://www.scopus.com/inward/record.url?scp=84891953782&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=84891953782&partnerID=8YFLogxK

M3 - Conference contribution

AN - SCOPUS:84891953782

VL - 824

SP - 14

EP - 21

BT - CEUR Workshop Proceedings

ER -