Syntax-to-morphology mapping in factored phrase-based statistical machine translation from English to Turkish

Reyyan Yeniterzi, Kemal Oflazer

Research output: Chapter in Book/Report/Conference proceedingConference contribution

28 Citations (Scopus)

Abstract

We present a novel scheme to apply factored phrase-based SMT to a language pair with very disparate morphological structures. Our approach relies on syntactic analysis on the source side (English) and then encodes a wide variety of local and non-local syntactic structures as complex structural tags which appear as additional factors in the training data. On the target side (Turkish), we only perform morphological analysis and disambiguation but treat the complete complex morphological tag as a factor, instead of separating morphemes. We incrementally explore capturing various syntactic substructures as complex tags on the English side, and evaluate how our translations improve in BLEU scores. Our maximal set of source and target side transformations, coupled with some additional techniques, provide an 39% relative improvement from a baseline 17.08 to 23.78 BLEU, all averaged over 10 training and test sets. Now that the syntactic analysis on the English side is available, we also experiment with more long distance constituent reordering to bring the English constituent order close to Turkish, but find that these transformations do not provide any additional consistent tangible gains when averaged over the 10 sets.

Original languageEnglish
Title of host publicationACL 2010 - 48th Annual Meeting of the Association for Computational Linguistics, Proceedings of the Conference
Pages454-464
Number of pages11
Publication statusPublished - 2010
Event48th Annual Meeting of the Association for Computational Linguistics, ACL 2010 - Uppsala, Sweden
Duration: 11 Jul 201016 Jul 2010

Other

Other48th Annual Meeting of the Association for Computational Linguistics, ACL 2010
CountrySweden
CityUppsala
Period11/7/1016/7/10

Fingerprint

syntax
experiment
language
Tag
Syntax
Statistical Machine Translation
Syntactic Analysis
Constituent
Morphological Analysis
Disambiguation
Morphological Structure
Experiment
Syntactic Structure
Morpheme
Constituent Order
Language

ASJC Scopus subject areas

  • Language and Linguistics
  • Linguistics and Language

Cite this

Yeniterzi, R., & Oflazer, K. (2010). Syntax-to-morphology mapping in factored phrase-based statistical machine translation from English to Turkish. In ACL 2010 - 48th Annual Meeting of the Association for Computational Linguistics, Proceedings of the Conference (pp. 454-464)

Syntax-to-morphology mapping in factored phrase-based statistical machine translation from English to Turkish. / Yeniterzi, Reyyan; Oflazer, Kemal.

ACL 2010 - 48th Annual Meeting of the Association for Computational Linguistics, Proceedings of the Conference. 2010. p. 454-464.

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Yeniterzi, R & Oflazer, K 2010, Syntax-to-morphology mapping in factored phrase-based statistical machine translation from English to Turkish. in ACL 2010 - 48th Annual Meeting of the Association for Computational Linguistics, Proceedings of the Conference. pp. 454-464, 48th Annual Meeting of the Association for Computational Linguistics, ACL 2010, Uppsala, Sweden, 11/7/10.
Yeniterzi R, Oflazer K. Syntax-to-morphology mapping in factored phrase-based statistical machine translation from English to Turkish. In ACL 2010 - 48th Annual Meeting of the Association for Computational Linguistics, Proceedings of the Conference. 2010. p. 454-464
Yeniterzi, Reyyan ; Oflazer, Kemal. / Syntax-to-morphology mapping in factored phrase-based statistical machine translation from English to Turkish. ACL 2010 - 48th Annual Meeting of the Association for Computational Linguistics, Proceedings of the Conference. 2010. pp. 454-464
@inproceedings{0d5264fdc1ca47b3a740cf52e38c0a40,
title = "Syntax-to-morphology mapping in factored phrase-based statistical machine translation from English to Turkish",
abstract = "We present a novel scheme to apply factored phrase-based SMT to a language pair with very disparate morphological structures. Our approach relies on syntactic analysis on the source side (English) and then encodes a wide variety of local and non-local syntactic structures as complex structural tags which appear as additional factors in the training data. On the target side (Turkish), we only perform morphological analysis and disambiguation but treat the complete complex morphological tag as a factor, instead of separating morphemes. We incrementally explore capturing various syntactic substructures as complex tags on the English side, and evaluate how our translations improve in BLEU scores. Our maximal set of source and target side transformations, coupled with some additional techniques, provide an 39{\%} relative improvement from a baseline 17.08 to 23.78 BLEU, all averaged over 10 training and test sets. Now that the syntactic analysis on the English side is available, we also experiment with more long distance constituent reordering to bring the English constituent order close to Turkish, but find that these transformations do not provide any additional consistent tangible gains when averaged over the 10 sets.",
author = "Reyyan Yeniterzi and Kemal Oflazer",
year = "2010",
language = "English",
isbn = "9781617388088",
pages = "454--464",
booktitle = "ACL 2010 - 48th Annual Meeting of the Association for Computational Linguistics, Proceedings of the Conference",

}

TY - GEN

T1 - Syntax-to-morphology mapping in factored phrase-based statistical machine translation from English to Turkish

AU - Yeniterzi, Reyyan

AU - Oflazer, Kemal

PY - 2010

Y1 - 2010

N2 - We present a novel scheme to apply factored phrase-based SMT to a language pair with very disparate morphological structures. Our approach relies on syntactic analysis on the source side (English) and then encodes a wide variety of local and non-local syntactic structures as complex structural tags which appear as additional factors in the training data. On the target side (Turkish), we only perform morphological analysis and disambiguation but treat the complete complex morphological tag as a factor, instead of separating morphemes. We incrementally explore capturing various syntactic substructures as complex tags on the English side, and evaluate how our translations improve in BLEU scores. Our maximal set of source and target side transformations, coupled with some additional techniques, provide an 39% relative improvement from a baseline 17.08 to 23.78 BLEU, all averaged over 10 training and test sets. Now that the syntactic analysis on the English side is available, we also experiment with more long distance constituent reordering to bring the English constituent order close to Turkish, but find that these transformations do not provide any additional consistent tangible gains when averaged over the 10 sets.

AB - We present a novel scheme to apply factored phrase-based SMT to a language pair with very disparate morphological structures. Our approach relies on syntactic analysis on the source side (English) and then encodes a wide variety of local and non-local syntactic structures as complex structural tags which appear as additional factors in the training data. On the target side (Turkish), we only perform morphological analysis and disambiguation but treat the complete complex morphological tag as a factor, instead of separating morphemes. We incrementally explore capturing various syntactic substructures as complex tags on the English side, and evaluate how our translations improve in BLEU scores. Our maximal set of source and target side transformations, coupled with some additional techniques, provide an 39% relative improvement from a baseline 17.08 to 23.78 BLEU, all averaged over 10 training and test sets. Now that the syntactic analysis on the English side is available, we also experiment with more long distance constituent reordering to bring the English constituent order close to Turkish, but find that these transformations do not provide any additional consistent tangible gains when averaged over the 10 sets.

UR - http://www.scopus.com/inward/record.url?scp=84859970431&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=84859970431&partnerID=8YFLogxK

M3 - Conference contribution

AN - SCOPUS:84859970431

SN - 9781617388088

SP - 454

EP - 464

BT - ACL 2010 - 48th Annual Meeting of the Association for Computational Linguistics, Proceedings of the Conference

ER -