The AMARA corpus: Building parallel language resources for the educational domain

Research output: Chapter in Book/Report/Conference proceedingConference contribution

18 Citations (Scopus)

Abstract

This paper presents the AMARA corpus of on-line educational content: a new parallel corpus of educational video subtitles, multi-lingually aligned for 20 languages, i.e. 20 monolingual corpora and 190 parallel corpora. This corpus includes both resource-rich languages such as English and Arabic, and resource-poor languages such as Hindi and Thai. In this paper, we describe the gathering, validation, and preprocessing of a large collection of parallel, community-generated subtitles. Furthermore, we describe the methodology used to prepare the data for Machine Translation tasks. Additionally, we provide a document-level, jointly aligned development and test sets for 14 language pairs, designed for tuning and testing Machine Translation systems. We provide baseline results for these tasks, and highlight some of the challenges we face when building machine translation systems for educational content.

Original languageEnglish
Title of host publicationProceedings of the 9th International Conference on Language Resources and Evaluation, LREC 2014
PublisherEuropean Language Resources Association (ELRA)
Pages1856-1862
Number of pages7
ISBN (Electronic)9782951740884
Publication statusPublished - 1 Jan 2014
Event9th International Conference on Language Resources and Evaluation, LREC 2014 - Reykjavik, Iceland
Duration: 26 May 201431 May 2014

Other

Other9th International Conference on Language Resources and Evaluation, LREC 2014
CountryIceland
CityReykjavik
Period26/5/1431/5/14

Fingerprint

educational content
language
resources
video
Resources
Language
Education
methodology
community
Parallel Corpora
Subtitles
Machine Translation System
Methodology
Machine Translation
Testing
Tuning

Keywords

  • Crowd-sourcing
  • Educational translation
  • Lecture translation
  • Multilingual
  • Parallel corpus

ASJC Scopus subject areas

  • Linguistics and Language
  • Library and Information Sciences
  • Education
  • Language and Linguistics

Cite this

Abdelali, A., Guzman, F., Sajjad, H., & Vogel, S. (2014). The AMARA corpus: Building parallel language resources for the educational domain. In Proceedings of the 9th International Conference on Language Resources and Evaluation, LREC 2014 (pp. 1856-1862). European Language Resources Association (ELRA).

The AMARA corpus : Building parallel language resources for the educational domain. / Abdelali, Ahmed; Guzman, Francisco; Sajjad, Hassan; Vogel, Stephan.

Proceedings of the 9th International Conference on Language Resources and Evaluation, LREC 2014. European Language Resources Association (ELRA), 2014. p. 1856-1862.

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Abdelali, A, Guzman, F, Sajjad, H & Vogel, S 2014, The AMARA corpus: Building parallel language resources for the educational domain. in Proceedings of the 9th International Conference on Language Resources and Evaluation, LREC 2014. European Language Resources Association (ELRA), pp. 1856-1862, 9th International Conference on Language Resources and Evaluation, LREC 2014, Reykjavik, Iceland, 26/5/14.
Abdelali A, Guzman F, Sajjad H, Vogel S. The AMARA corpus: Building parallel language resources for the educational domain. In Proceedings of the 9th International Conference on Language Resources and Evaluation, LREC 2014. European Language Resources Association (ELRA). 2014. p. 1856-1862
Abdelali, Ahmed ; Guzman, Francisco ; Sajjad, Hassan ; Vogel, Stephan. / The AMARA corpus : Building parallel language resources for the educational domain. Proceedings of the 9th International Conference on Language Resources and Evaluation, LREC 2014. European Language Resources Association (ELRA), 2014. pp. 1856-1862
@inproceedings{e8ff55afaf2642da9804c71bfd1d074c,
title = "The AMARA corpus: Building parallel language resources for the educational domain",
abstract = "This paper presents the AMARA corpus of on-line educational content: a new parallel corpus of educational video subtitles, multi-lingually aligned for 20 languages, i.e. 20 monolingual corpora and 190 parallel corpora. This corpus includes both resource-rich languages such as English and Arabic, and resource-poor languages such as Hindi and Thai. In this paper, we describe the gathering, validation, and preprocessing of a large collection of parallel, community-generated subtitles. Furthermore, we describe the methodology used to prepare the data for Machine Translation tasks. Additionally, we provide a document-level, jointly aligned development and test sets for 14 language pairs, designed for tuning and testing Machine Translation systems. We provide baseline results for these tasks, and highlight some of the challenges we face when building machine translation systems for educational content.",
keywords = "Crowd-sourcing, Educational translation, Lecture translation, Multilingual, Parallel corpus",
author = "Ahmed Abdelali and Francisco Guzman and Hassan Sajjad and Stephan Vogel",
year = "2014",
month = "1",
day = "1",
language = "English",
pages = "1856--1862",
booktitle = "Proceedings of the 9th International Conference on Language Resources and Evaluation, LREC 2014",
publisher = "European Language Resources Association (ELRA)",

}

TY - GEN

T1 - The AMARA corpus

T2 - Building parallel language resources for the educational domain

AU - Abdelali, Ahmed

AU - Guzman, Francisco

AU - Sajjad, Hassan

AU - Vogel, Stephan

PY - 2014/1/1

Y1 - 2014/1/1

N2 - This paper presents the AMARA corpus of on-line educational content: a new parallel corpus of educational video subtitles, multi-lingually aligned for 20 languages, i.e. 20 monolingual corpora and 190 parallel corpora. This corpus includes both resource-rich languages such as English and Arabic, and resource-poor languages such as Hindi and Thai. In this paper, we describe the gathering, validation, and preprocessing of a large collection of parallel, community-generated subtitles. Furthermore, we describe the methodology used to prepare the data for Machine Translation tasks. Additionally, we provide a document-level, jointly aligned development and test sets for 14 language pairs, designed for tuning and testing Machine Translation systems. We provide baseline results for these tasks, and highlight some of the challenges we face when building machine translation systems for educational content.

AB - This paper presents the AMARA corpus of on-line educational content: a new parallel corpus of educational video subtitles, multi-lingually aligned for 20 languages, i.e. 20 monolingual corpora and 190 parallel corpora. This corpus includes both resource-rich languages such as English and Arabic, and resource-poor languages such as Hindi and Thai. In this paper, we describe the gathering, validation, and preprocessing of a large collection of parallel, community-generated subtitles. Furthermore, we describe the methodology used to prepare the data for Machine Translation tasks. Additionally, we provide a document-level, jointly aligned development and test sets for 14 language pairs, designed for tuning and testing Machine Translation systems. We provide baseline results for these tasks, and highlight some of the challenges we face when building machine translation systems for educational content.

KW - Crowd-sourcing

KW - Educational translation

KW - Lecture translation

KW - Multilingual

KW - Parallel corpus

UR - http://www.scopus.com/inward/record.url?scp=84959884018&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=84959884018&partnerID=8YFLogxK

M3 - Conference contribution

AN - SCOPUS:84959884018

SP - 1856

EP - 1862

BT - Proceedings of the 9th International Conference on Language Resources and Evaluation, LREC 2014

PB - European Language Resources Association (ELRA)

ER -