A hybrid morpheme-word representation for machine translation of morphologically rich languages

Minh Thang Luong, Preslav Nakov, Min Yen Kan

Research output: Chapter in Book/Report/Conference proceedingConference contribution

17 Citations (Scopus)

Abstract

We propose a language-independent approach for improving statistical machine translation for morphologically rich languages using a hybrid morpheme-word representation where the basic unit of translation is the morpheme, but word boundaries are respected at all stages of the translation process. Our model extends the classic phrase-based model by means of (1) word boundary-aware morpheme-level phrase extraction, (2) minimum error-rate training for a morpheme-level translation model using word-level BLEU, and (3) joint scoring with morpheme- and word-level language models. Further improvements are achieved by combining our model with the classic one. The evaluation on English to Finnish using Europarl (714K sentence pairs; 15.5M English words) shows statistically significant improvements over the classic model based on BLEU and human judgments.

Original languageEnglish
Title of host publicationEMNLP 2010 - Conference on Empirical Methods in Natural Language Processing, Proceedings of the Conference
Pages148-157
Number of pages10
Publication statusPublished - 1 Dec 2010
Externally publishedYes
EventConference on Empirical Methods in Natural Language Processing, EMNLP 2010 - Cambridge, MA, United States
Duration: 9 Oct 201011 Oct 2010

Other

OtherConference on Empirical Methods in Natural Language Processing, EMNLP 2010
CountryUnited States
CityCambridge, MA
Period9/10/1011/10/10

ASJC Scopus subject areas

  • Computational Theory and Mathematics
  • Computer Science Applications
  • Information Systems

Cite this

Luong, M. T., Nakov, P., & Kan, M. Y. (2010). A hybrid morpheme-word representation for machine translation of morphologically rich languages. In EMNLP 2010 - Conference on Empirical Methods in Natural Language Processing, Proceedings of the Conference (pp. 148-157)

A hybrid morpheme-word representation for machine translation of morphologically rich languages. / Luong, Minh Thang; Nakov, Preslav; Kan, Min Yen.

EMNLP 2010 - Conference on Empirical Methods in Natural Language Processing, Proceedings of the Conference. 2010. p. 148-157.

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Luong, MT, Nakov, P & Kan, MY 2010, A hybrid morpheme-word representation for machine translation of morphologically rich languages. in EMNLP 2010 - Conference on Empirical Methods in Natural Language Processing, Proceedings of the Conference. pp. 148-157, Conference on Empirical Methods in Natural Language Processing, EMNLP 2010, Cambridge, MA, United States, 9/10/10.
Luong MT, Nakov P, Kan MY. A hybrid morpheme-word representation for machine translation of morphologically rich languages. In EMNLP 2010 - Conference on Empirical Methods in Natural Language Processing, Proceedings of the Conference. 2010. p. 148-157
Luong, Minh Thang ; Nakov, Preslav ; Kan, Min Yen. / A hybrid morpheme-word representation for machine translation of morphologically rich languages. EMNLP 2010 - Conference on Empirical Methods in Natural Language Processing, Proceedings of the Conference. 2010. pp. 148-157
@inproceedings{bdf6dd83a97547a287cec06bc781346f,
title = "A hybrid morpheme-word representation for machine translation of morphologically rich languages",
abstract = "We propose a language-independent approach for improving statistical machine translation for morphologically rich languages using a hybrid morpheme-word representation where the basic unit of translation is the morpheme, but word boundaries are respected at all stages of the translation process. Our model extends the classic phrase-based model by means of (1) word boundary-aware morpheme-level phrase extraction, (2) minimum error-rate training for a morpheme-level translation model using word-level BLEU, and (3) joint scoring with morpheme- and word-level language models. Further improvements are achieved by combining our model with the classic one. The evaluation on English to Finnish using Europarl (714K sentence pairs; 15.5M English words) shows statistically significant improvements over the classic model based on BLEU and human judgments.",
author = "Luong, {Minh Thang} and Preslav Nakov and Kan, {Min Yen}",
year = "2010",
month = "12",
day = "1",
language = "English",
isbn = "1932432868",
pages = "148--157",
booktitle = "EMNLP 2010 - Conference on Empirical Methods in Natural Language Processing, Proceedings of the Conference",

}

TY - GEN

T1 - A hybrid morpheme-word representation for machine translation of morphologically rich languages

AU - Luong, Minh Thang

AU - Nakov, Preslav

AU - Kan, Min Yen

PY - 2010/12/1

Y1 - 2010/12/1

N2 - We propose a language-independent approach for improving statistical machine translation for morphologically rich languages using a hybrid morpheme-word representation where the basic unit of translation is the morpheme, but word boundaries are respected at all stages of the translation process. Our model extends the classic phrase-based model by means of (1) word boundary-aware morpheme-level phrase extraction, (2) minimum error-rate training for a morpheme-level translation model using word-level BLEU, and (3) joint scoring with morpheme- and word-level language models. Further improvements are achieved by combining our model with the classic one. The evaluation on English to Finnish using Europarl (714K sentence pairs; 15.5M English words) shows statistically significant improvements over the classic model based on BLEU and human judgments.

AB - We propose a language-independent approach for improving statistical machine translation for morphologically rich languages using a hybrid morpheme-word representation where the basic unit of translation is the morpheme, but word boundaries are respected at all stages of the translation process. Our model extends the classic phrase-based model by means of (1) word boundary-aware morpheme-level phrase extraction, (2) minimum error-rate training for a morpheme-level translation model using word-level BLEU, and (3) joint scoring with morpheme- and word-level language models. Further improvements are achieved by combining our model with the classic one. The evaluation on English to Finnish using Europarl (714K sentence pairs; 15.5M English words) shows statistically significant improvements over the classic model based on BLEU and human judgments.

UR - http://www.scopus.com/inward/record.url?scp=80053240423&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=80053240423&partnerID=8YFLogxK

M3 - Conference contribution

SN - 1932432868

SN - 9781932432862

SP - 148

EP - 157

BT - EMNLP 2010 - Conference on Empirical Methods in Natural Language Processing, Proceedings of the Conference

ER -