The FAUST corpus of adequacy assessments for real-world machine translation output

Daniele Pighin, Lluis Marques, Lluís Formiga

Research output: Chapter in Book/Report/Conference proceedingConference contribution

2 Citations (Scopus)

Abstract

We present a corpus consisting of 11,292 real-world English to Spanish automatic translations annotated with relative (ranking) and absolute (adequate/non-adequate) quality assessments. The translation requests, collected through the popular translation portal http://reverso.net, provide a most variated sample of real-world machine translation (MT) usage, from complete sentences to units of one or two words, from well-formed to hardly intelligible texts, from technical documents to colloquial and slang snippets. In this paper, we present 1) a preliminary annotation experiment that we carried out to select the most appropriate quality criterion to be used for these data, 2) a graph-based methodology inspired by Interactive Genetic Algorithms to reduce the annotation effort, and 3) the outcomes of the full-scale annotation experiment, which result in a valuable and original resource for the analysis and characterization of MT-output quality.

Original languageEnglish
Title of host publicationProceedings of the 8th International Conference on Language Resources and Evaluation, LREC 2012
PublisherEuropean Language Resources Association (ELRA)
Pages29-35
Number of pages7
ISBN (Electronic)9782951740877
Publication statusPublished - 1 Jan 2012
Event8th International Conference on Language Resources and Evaluation, LREC 2012 - Istanbul, Turkey
Duration: 21 May 201227 May 2012

Other

Other8th International Conference on Language Resources and Evaluation, LREC 2012
CountryTurkey
CityIstanbul
Period21/5/1227/5/12

Fingerprint

colloquial
experiment
ranking
Machine Translation
Real World
Adequacy
Annotation
methodology
Experiment
resources
Graph
Genetic Algorithm
Methodology
World Englishes
Quality Assessment
Slang
Ranking
Automatic Translation
Resources

Keywords

  • Annotated Corpus
  • Machine Translation
  • Quality Assessments

ASJC Scopus subject areas

  • Linguistics and Language
  • Language and Linguistics
  • Education
  • Library and Information Sciences

Cite this

Pighin, D., Marques, L., & Formiga, L. (2012). The FAUST corpus of adequacy assessments for real-world machine translation output. In Proceedings of the 8th International Conference on Language Resources and Evaluation, LREC 2012 (pp. 29-35). European Language Resources Association (ELRA).

The FAUST corpus of adequacy assessments for real-world machine translation output. / Pighin, Daniele; Marques, Lluis; Formiga, Lluís.

Proceedings of the 8th International Conference on Language Resources and Evaluation, LREC 2012. European Language Resources Association (ELRA), 2012. p. 29-35.

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Pighin, D, Marques, L & Formiga, L 2012, The FAUST corpus of adequacy assessments for real-world machine translation output. in Proceedings of the 8th International Conference on Language Resources and Evaluation, LREC 2012. European Language Resources Association (ELRA), pp. 29-35, 8th International Conference on Language Resources and Evaluation, LREC 2012, Istanbul, Turkey, 21/5/12.
Pighin D, Marques L, Formiga L. The FAUST corpus of adequacy assessments for real-world machine translation output. In Proceedings of the 8th International Conference on Language Resources and Evaluation, LREC 2012. European Language Resources Association (ELRA). 2012. p. 29-35
Pighin, Daniele ; Marques, Lluis ; Formiga, Lluís. / The FAUST corpus of adequacy assessments for real-world machine translation output. Proceedings of the 8th International Conference on Language Resources and Evaluation, LREC 2012. European Language Resources Association (ELRA), 2012. pp. 29-35
@inproceedings{0a2b811d279a48a0b988f467ff93191a,
title = "The FAUST corpus of adequacy assessments for real-world machine translation output",
abstract = "We present a corpus consisting of 11,292 real-world English to Spanish automatic translations annotated with relative (ranking) and absolute (adequate/non-adequate) quality assessments. The translation requests, collected through the popular translation portal http://reverso.net, provide a most variated sample of real-world machine translation (MT) usage, from complete sentences to units of one or two words, from well-formed to hardly intelligible texts, from technical documents to colloquial and slang snippets. In this paper, we present 1) a preliminary annotation experiment that we carried out to select the most appropriate quality criterion to be used for these data, 2) a graph-based methodology inspired by Interactive Genetic Algorithms to reduce the annotation effort, and 3) the outcomes of the full-scale annotation experiment, which result in a valuable and original resource for the analysis and characterization of MT-output quality.",
keywords = "Annotated Corpus, Machine Translation, Quality Assessments",
author = "Daniele Pighin and Lluis Marques and Llu{\'i}s Formiga",
year = "2012",
month = "1",
day = "1",
language = "English",
pages = "29--35",
booktitle = "Proceedings of the 8th International Conference on Language Resources and Evaluation, LREC 2012",
publisher = "European Language Resources Association (ELRA)",

}

TY - GEN

T1 - The FAUST corpus of adequacy assessments for real-world machine translation output

AU - Pighin, Daniele

AU - Marques, Lluis

AU - Formiga, Lluís

PY - 2012/1/1

Y1 - 2012/1/1

N2 - We present a corpus consisting of 11,292 real-world English to Spanish automatic translations annotated with relative (ranking) and absolute (adequate/non-adequate) quality assessments. The translation requests, collected through the popular translation portal http://reverso.net, provide a most variated sample of real-world machine translation (MT) usage, from complete sentences to units of one or two words, from well-formed to hardly intelligible texts, from technical documents to colloquial and slang snippets. In this paper, we present 1) a preliminary annotation experiment that we carried out to select the most appropriate quality criterion to be used for these data, 2) a graph-based methodology inspired by Interactive Genetic Algorithms to reduce the annotation effort, and 3) the outcomes of the full-scale annotation experiment, which result in a valuable and original resource for the analysis and characterization of MT-output quality.

AB - We present a corpus consisting of 11,292 real-world English to Spanish automatic translations annotated with relative (ranking) and absolute (adequate/non-adequate) quality assessments. The translation requests, collected through the popular translation portal http://reverso.net, provide a most variated sample of real-world machine translation (MT) usage, from complete sentences to units of one or two words, from well-formed to hardly intelligible texts, from technical documents to colloquial and slang snippets. In this paper, we present 1) a preliminary annotation experiment that we carried out to select the most appropriate quality criterion to be used for these data, 2) a graph-based methodology inspired by Interactive Genetic Algorithms to reduce the annotation effort, and 3) the outcomes of the full-scale annotation experiment, which result in a valuable and original resource for the analysis and characterization of MT-output quality.

KW - Annotated Corpus

KW - Machine Translation

KW - Quality Assessments

UR - http://www.scopus.com/inward/record.url?scp=84898444889&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=84898444889&partnerID=8YFLogxK

M3 - Conference contribution

AN - SCOPUS:84898444889

SP - 29

EP - 35

BT - Proceedings of the 8th International Conference on Language Resources and Evaluation, LREC 2012

PB - European Language Resources Association (ELRA)

ER -