A hybrid morpheme-word representation for machine translation of morphologically rich languages

Minh Thang Luong, Preslav Nakov, Min Yen Kan

Research output: Chapter in Book/Report/Conference proceedingConference contribution

19 Citations (Scopus)

Abstract

We propose a language-independent approach for improving statistical machine translation for morphologically rich languages using a hybrid morpheme-word representation where the basic unit of translation is the morpheme, but word boundaries are respected at all stages of the translation process. Our model extends the classic phrase-based model by means of (1) word boundary-aware morpheme-level phrase extraction, (2) minimum error-rate training for a morpheme-level translation model using word-level BLEU, and (3) joint scoring with morpheme- and word-level language models. Further improvements are achieved by combining our model with the classic one. The evaluation on English to Finnish using Europarl (714K sentence pairs; 15.5M English words) shows statistically significant improvements over the classic model based on BLEU and human judgments.

Original languageEnglish
Title of host publicationEMNLP 2010 - Conference on Empirical Methods in Natural Language Processing, Proceedings of the Conference
Pages148-157
Number of pages10
Publication statusPublished - 1 Dec 2010
EventConference on Empirical Methods in Natural Language Processing, EMNLP 2010 - Cambridge, MA, United States
Duration: 9 Oct 201011 Oct 2010

Publication series

NameEMNLP 2010 - Conference on Empirical Methods in Natural Language Processing, Proceedings of the Conference

Other

OtherConference on Empirical Methods in Natural Language Processing, EMNLP 2010
CountryUnited States
CityCambridge, MA
Period9/10/1011/10/10

ASJC Scopus subject areas

  • Computational Theory and Mathematics
  • Computer Science Applications
  • Information Systems

Cite this

Luong, M. T., Nakov, P., & Kan, M. Y. (2010). A hybrid morpheme-word representation for machine translation of morphologically rich languages. In EMNLP 2010 - Conference on Empirical Methods in Natural Language Processing, Proceedings of the Conference (pp. 148-157). (EMNLP 2010 - Conference on Empirical Methods in Natural Language Processing, Proceedings of the Conference).