Bi-text alignment of movie subtitles for spoken english-arabic statistical machine translation

Fahad Al-Obaidli, Stephen Cox, Preslav Nakov

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Abstract

We describe efforts towards getting better resources for English-Arabic machine translation of spoken text. In particular, we look at movie subtitles as a unique, rich resource, as subtitles in one language often get translated into other languages. Movie subtitles are not new as a resource and have been explored in previous research; however, here we create a much larger bi-text (the biggest to date), and we further generate better quality alignment for it. Given the subtitles for the same movie in different languages, a key problem is how to align them at the fragment level. Typically, this is done using length-based alignment, but for movie subtitles, there is also time information. Here we exploit this information to develop an original algorithm that outperforms the current best subtitle alignment tool, subalign. The evaluation results show that adding our bi-text to the IWSLT training bi-text yields an improvement of over two BLEU points absolute.

Original languageEnglish
Title of host publicationComputational Linguistics and Intelligent Text Processing - 17th International Conference, CICLing 2016, Revised Selected Papers
PublisherSpringer Verlag
Pages127-139
Number of pages13
ISBN (Print)9783319754864
DOIs
Publication statusPublished - 1 Jan 2018
Event17th International Conference on Intelligent Text Processing and Computational Linguistics, CICLing 2016 - Konya, Turkey
Duration: 3 Apr 20169 Apr 2016

Publication series

NameLecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)
Volume9624 LNCS
ISSN (Print)0302-9743
ISSN (Electronic)1611-3349

Other

Other17th International Conference on Intelligent Text Processing and Computational Linguistics, CICLing 2016
CountryTurkey
CityKonya
Period3/4/169/4/16

Fingerprint

Statistical Machine Translation
Alignment
Resources
Machine Translation
Fragment
Text
Evaluation
Language

Keywords

  • Bi-text alignment
  • Machine translation
  • Movie subtitles

ASJC Scopus subject areas

  • Theoretical Computer Science
  • Computer Science(all)

Cite this

Al-Obaidli, F., Cox, S., & Nakov, P. (2018). Bi-text alignment of movie subtitles for spoken english-arabic statistical machine translation. In Computational Linguistics and Intelligent Text Processing - 17th International Conference, CICLing 2016, Revised Selected Papers (pp. 127-139). (Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics); Vol. 9624 LNCS). Springer Verlag. https://doi.org/10.1007/978-3-319-75487-1_11

Bi-text alignment of movie subtitles for spoken english-arabic statistical machine translation. / Al-Obaidli, Fahad; Cox, Stephen; Nakov, Preslav.

Computational Linguistics and Intelligent Text Processing - 17th International Conference, CICLing 2016, Revised Selected Papers. Springer Verlag, 2018. p. 127-139 (Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics); Vol. 9624 LNCS).

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Al-Obaidli, F, Cox, S & Nakov, P 2018, Bi-text alignment of movie subtitles for spoken english-arabic statistical machine translation. in Computational Linguistics and Intelligent Text Processing - 17th International Conference, CICLing 2016, Revised Selected Papers. Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics), vol. 9624 LNCS, Springer Verlag, pp. 127-139, 17th International Conference on Intelligent Text Processing and Computational Linguistics, CICLing 2016, Konya, Turkey, 3/4/16. https://doi.org/10.1007/978-3-319-75487-1_11
Al-Obaidli F, Cox S, Nakov P. Bi-text alignment of movie subtitles for spoken english-arabic statistical machine translation. In Computational Linguistics and Intelligent Text Processing - 17th International Conference, CICLing 2016, Revised Selected Papers. Springer Verlag. 2018. p. 127-139. (Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)). https://doi.org/10.1007/978-3-319-75487-1_11
Al-Obaidli, Fahad ; Cox, Stephen ; Nakov, Preslav. / Bi-text alignment of movie subtitles for spoken english-arabic statistical machine translation. Computational Linguistics and Intelligent Text Processing - 17th International Conference, CICLing 2016, Revised Selected Papers. Springer Verlag, 2018. pp. 127-139 (Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)).
@inproceedings{7554b7b6b7894cd0a1a7c4e2b14cf4d7,
title = "Bi-text alignment of movie subtitles for spoken english-arabic statistical machine translation",
abstract = "We describe efforts towards getting better resources for English-Arabic machine translation of spoken text. In particular, we look at movie subtitles as a unique, rich resource, as subtitles in one language often get translated into other languages. Movie subtitles are not new as a resource and have been explored in previous research; however, here we create a much larger bi-text (the biggest to date), and we further generate better quality alignment for it. Given the subtitles for the same movie in different languages, a key problem is how to align them at the fragment level. Typically, this is done using length-based alignment, but for movie subtitles, there is also time information. Here we exploit this information to develop an original algorithm that outperforms the current best subtitle alignment tool, subalign. The evaluation results show that adding our bi-text to the IWSLT training bi-text yields an improvement of over two BLEU points absolute.",
keywords = "Bi-text alignment, Machine translation, Movie subtitles",
author = "Fahad Al-Obaidli and Stephen Cox and Preslav Nakov",
year = "2018",
month = "1",
day = "1",
doi = "10.1007/978-3-319-75487-1_11",
language = "English",
isbn = "9783319754864",
series = "Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)",
publisher = "Springer Verlag",
pages = "127--139",
booktitle = "Computational Linguistics and Intelligent Text Processing - 17th International Conference, CICLing 2016, Revised Selected Papers",

}

TY - GEN

T1 - Bi-text alignment of movie subtitles for spoken english-arabic statistical machine translation

AU - Al-Obaidli, Fahad

AU - Cox, Stephen

AU - Nakov, Preslav

PY - 2018/1/1

Y1 - 2018/1/1

N2 - We describe efforts towards getting better resources for English-Arabic machine translation of spoken text. In particular, we look at movie subtitles as a unique, rich resource, as subtitles in one language often get translated into other languages. Movie subtitles are not new as a resource and have been explored in previous research; however, here we create a much larger bi-text (the biggest to date), and we further generate better quality alignment for it. Given the subtitles for the same movie in different languages, a key problem is how to align them at the fragment level. Typically, this is done using length-based alignment, but for movie subtitles, there is also time information. Here we exploit this information to develop an original algorithm that outperforms the current best subtitle alignment tool, subalign. The evaluation results show that adding our bi-text to the IWSLT training bi-text yields an improvement of over two BLEU points absolute.

AB - We describe efforts towards getting better resources for English-Arabic machine translation of spoken text. In particular, we look at movie subtitles as a unique, rich resource, as subtitles in one language often get translated into other languages. Movie subtitles are not new as a resource and have been explored in previous research; however, here we create a much larger bi-text (the biggest to date), and we further generate better quality alignment for it. Given the subtitles for the same movie in different languages, a key problem is how to align them at the fragment level. Typically, this is done using length-based alignment, but for movie subtitles, there is also time information. Here we exploit this information to develop an original algorithm that outperforms the current best subtitle alignment tool, subalign. The evaluation results show that adding our bi-text to the IWSLT training bi-text yields an improvement of over two BLEU points absolute.

KW - Bi-text alignment

KW - Machine translation

KW - Movie subtitles

UR - http://www.scopus.com/inward/record.url?scp=85044421443&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=85044421443&partnerID=8YFLogxK

U2 - 10.1007/978-3-319-75487-1_11

DO - 10.1007/978-3-319-75487-1_11

M3 - Conference contribution

SN - 9783319754864

T3 - Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)

SP - 127

EP - 139

BT - Computational Linguistics and Intelligent Text Processing - 17th International Conference, CICLing 2016, Revised Selected Papers

PB - Springer Verlag

ER -