Analyzing the use of character-level translation with sparse and noisy datasets

Jörg Tiedemann, Preslav Nakov

Research output: Chapter in Book/Report/Conference proceedingConference contribution

7 Citations (Scopus)

Abstract

This paper provides an analysis of character-level machine translation models used in pivot-based translation when applied to sparse and noisy datasets, such as crowdsourced movie subtitles. In our experiments, we find that such characterlevel models cut the number of untranslated words by over 40% and are especially competitive (improvements of 2-3 BLEU points) in the case of limited training data. We explore the impact of character alignment, phrase table filtering, bitext size and the choice of pivot language on translation quality. We further compare cascaded translation models to the use of synthetic training data via multiple pivots, and we find that the latter works significantly better. Finally, we demonstrate that neither word- nor character-BLEU correlate perfectly with human judgments, due to BLEU's sensitivity to length.

Original languageEnglish
Title of host publicationInternational Conference Recent Advances in Natural Language Processing, RANLP
Pages676-684
Number of pages9
Publication statusPublished - 2013
Event9th International Conference on Recent Advances in Natural Language Processing, RANLP 2013 - Hissar, Bulgaria
Duration: 9 Sep 201311 Sep 2013

Other

Other9th International Conference on Recent Advances in Natural Language Processing, RANLP 2013
CountryBulgaria
CityHissar
Period9/9/1311/9/13

Fingerprint

Experiments

ASJC Scopus subject areas

  • Artificial Intelligence
  • Computer Science Applications
  • Software
  • Electrical and Electronic Engineering

Cite this

Tiedemann, J., & Nakov, P. (2013). Analyzing the use of character-level translation with sparse and noisy datasets. In International Conference Recent Advances in Natural Language Processing, RANLP (pp. 676-684)

Analyzing the use of character-level translation with sparse and noisy datasets. / Tiedemann, Jörg; Nakov, Preslav.

International Conference Recent Advances in Natural Language Processing, RANLP. 2013. p. 676-684.

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Tiedemann, J & Nakov, P 2013, Analyzing the use of character-level translation with sparse and noisy datasets. in International Conference Recent Advances in Natural Language Processing, RANLP. pp. 676-684, 9th International Conference on Recent Advances in Natural Language Processing, RANLP 2013, Hissar, Bulgaria, 9/9/13.
Tiedemann J, Nakov P. Analyzing the use of character-level translation with sparse and noisy datasets. In International Conference Recent Advances in Natural Language Processing, RANLP. 2013. p. 676-684
Tiedemann, Jörg ; Nakov, Preslav. / Analyzing the use of character-level translation with sparse and noisy datasets. International Conference Recent Advances in Natural Language Processing, RANLP. 2013. pp. 676-684
@inproceedings{d5c0ab04fe2f45d388e456cc1650c6d8,
title = "Analyzing the use of character-level translation with sparse and noisy datasets",
abstract = "This paper provides an analysis of character-level machine translation models used in pivot-based translation when applied to sparse and noisy datasets, such as crowdsourced movie subtitles. In our experiments, we find that such characterlevel models cut the number of untranslated words by over 40{\%} and are especially competitive (improvements of 2-3 BLEU points) in the case of limited training data. We explore the impact of character alignment, phrase table filtering, bitext size and the choice of pivot language on translation quality. We further compare cascaded translation models to the use of synthetic training data via multiple pivots, and we find that the latter works significantly better. Finally, we demonstrate that neither word- nor character-BLEU correlate perfectly with human judgments, due to BLEU's sensitivity to length.",
author = "J{\"o}rg Tiedemann and Preslav Nakov",
year = "2013",
language = "English",
pages = "676--684",
booktitle = "International Conference Recent Advances in Natural Language Processing, RANLP",

}

TY - GEN

T1 - Analyzing the use of character-level translation with sparse and noisy datasets

AU - Tiedemann, Jörg

AU - Nakov, Preslav

PY - 2013

Y1 - 2013

N2 - This paper provides an analysis of character-level machine translation models used in pivot-based translation when applied to sparse and noisy datasets, such as crowdsourced movie subtitles. In our experiments, we find that such characterlevel models cut the number of untranslated words by over 40% and are especially competitive (improvements of 2-3 BLEU points) in the case of limited training data. We explore the impact of character alignment, phrase table filtering, bitext size and the choice of pivot language on translation quality. We further compare cascaded translation models to the use of synthetic training data via multiple pivots, and we find that the latter works significantly better. Finally, we demonstrate that neither word- nor character-BLEU correlate perfectly with human judgments, due to BLEU's sensitivity to length.

AB - This paper provides an analysis of character-level machine translation models used in pivot-based translation when applied to sparse and noisy datasets, such as crowdsourced movie subtitles. In our experiments, we find that such characterlevel models cut the number of untranslated words by over 40% and are especially competitive (improvements of 2-3 BLEU points) in the case of limited training data. We explore the impact of character alignment, phrase table filtering, bitext size and the choice of pivot language on translation quality. We further compare cascaded translation models to the use of synthetic training data via multiple pivots, and we find that the latter works significantly better. Finally, we demonstrate that neither word- nor character-BLEU correlate perfectly with human judgments, due to BLEU's sensitivity to length.

UR - http://www.scopus.com/inward/record.url?scp=84890512106&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=84890512106&partnerID=8YFLogxK

M3 - Conference contribution

AN - SCOPUS:84890512106

SP - 676

EP - 684

BT - International Conference Recent Advances in Natural Language Processing, RANLP

ER -