Linguistic measures for automatic machine translation evaluation

Jesús Giménez, Lluis Marques

Research output: Contribution to journalArticle

14 Citations (Scopus)

Abstract

Assessing the quality of candidate translations involves diverse linguistic facets. However, most automatic evaluation methods in use today rely on limited quality assumptions, such as lexical similarity. This introduces a bias in the development cycle which in some cases has been reported to carry very negative consequences. In order to tackle this methodological problem, we explore a novel path towards heterogeneous automatic Machine Translation evaluation. We have compiled a rich set of specialized similarity measures operating at different linguistic dimensions and analyzed their individual and collective behaviour over a wide range of evaluation scenarios. Results show that measures based on syntactic and semantic information are able to provide more reliable system rankings than lexical measures, especially when the systems under evaluation are based on different paradigms. At the sentence level, while some linguistic measures perform better than most lexical measures, some others perform substantially worse, mainly due to parsing problems. Their scores are, however, suitable for combination, yielding a substantially improved evaluation quality.

Original languageEnglish
Pages (from-to)209-240
Number of pages32
JournalMachine Translation
Volume24
Issue number3-4
DOIs
Publication statusPublished - 1 Dec 2010
Externally publishedYes

Fingerprint

Linguistics
linguistics
evaluation
Syntactics
Semantics
collective behavior
ranking
candidacy
semantics
Machine Translation
Evaluation
paradigm
scenario
trend

Keywords

  • Automatic evaluation methods
  • Combined measures
  • Linguistic analysis
  • Machine translation
  • Semantic similarity
  • Syntactic similarity

ASJC Scopus subject areas

  • Artificial Intelligence
  • Software

Cite this

Linguistic measures for automatic machine translation evaluation. / Giménez, Jesús; Marques, Lluis.

In: Machine Translation, Vol. 24, No. 3-4, 01.12.2010, p. 209-240.

Research output: Contribution to journalArticle

@article{78819377b5e0406287787009787e93b2,
title = "Linguistic measures for automatic machine translation evaluation",
abstract = "Assessing the quality of candidate translations involves diverse linguistic facets. However, most automatic evaluation methods in use today rely on limited quality assumptions, such as lexical similarity. This introduces a bias in the development cycle which in some cases has been reported to carry very negative consequences. In order to tackle this methodological problem, we explore a novel path towards heterogeneous automatic Machine Translation evaluation. We have compiled a rich set of specialized similarity measures operating at different linguistic dimensions and analyzed their individual and collective behaviour over a wide range of evaluation scenarios. Results show that measures based on syntactic and semantic information are able to provide more reliable system rankings than lexical measures, especially when the systems under evaluation are based on different paradigms. At the sentence level, while some linguistic measures perform better than most lexical measures, some others perform substantially worse, mainly due to parsing problems. Their scores are, however, suitable for combination, yielding a substantially improved evaluation quality.",
keywords = "Automatic evaluation methods, Combined measures, Linguistic analysis, Machine translation, Semantic similarity, Syntactic similarity",
author = "Jes{\'u}s Gim{\'e}nez and Lluis Marques",
year = "2010",
month = "12",
day = "1",
doi = "10.1007/s10590-011-9088-7",
language = "English",
volume = "24",
pages = "209--240",
journal = "Machine Translation",
issn = "0922-6567",
publisher = "Springer Netherlands",
number = "3-4",

}

TY - JOUR

T1 - Linguistic measures for automatic machine translation evaluation

AU - Giménez, Jesús

AU - Marques, Lluis

PY - 2010/12/1

Y1 - 2010/12/1

N2 - Assessing the quality of candidate translations involves diverse linguistic facets. However, most automatic evaluation methods in use today rely on limited quality assumptions, such as lexical similarity. This introduces a bias in the development cycle which in some cases has been reported to carry very negative consequences. In order to tackle this methodological problem, we explore a novel path towards heterogeneous automatic Machine Translation evaluation. We have compiled a rich set of specialized similarity measures operating at different linguistic dimensions and analyzed their individual and collective behaviour over a wide range of evaluation scenarios. Results show that measures based on syntactic and semantic information are able to provide more reliable system rankings than lexical measures, especially when the systems under evaluation are based on different paradigms. At the sentence level, while some linguistic measures perform better than most lexical measures, some others perform substantially worse, mainly due to parsing problems. Their scores are, however, suitable for combination, yielding a substantially improved evaluation quality.

AB - Assessing the quality of candidate translations involves diverse linguistic facets. However, most automatic evaluation methods in use today rely on limited quality assumptions, such as lexical similarity. This introduces a bias in the development cycle which in some cases has been reported to carry very negative consequences. In order to tackle this methodological problem, we explore a novel path towards heterogeneous automatic Machine Translation evaluation. We have compiled a rich set of specialized similarity measures operating at different linguistic dimensions and analyzed their individual and collective behaviour over a wide range of evaluation scenarios. Results show that measures based on syntactic and semantic information are able to provide more reliable system rankings than lexical measures, especially when the systems under evaluation are based on different paradigms. At the sentence level, while some linguistic measures perform better than most lexical measures, some others perform substantially worse, mainly due to parsing problems. Their scores are, however, suitable for combination, yielding a substantially improved evaluation quality.

KW - Automatic evaluation methods

KW - Combined measures

KW - Linguistic analysis

KW - Machine translation

KW - Semantic similarity

KW - Syntactic similarity

UR - http://www.scopus.com/inward/record.url?scp=79955829366&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=79955829366&partnerID=8YFLogxK

U2 - 10.1007/s10590-011-9088-7

DO - 10.1007/s10590-011-9088-7

M3 - Article

AN - SCOPUS:79955829366

VL - 24

SP - 209

EP - 240

JO - Machine Translation

JF - Machine Translation

SN - 0922-6567

IS - 3-4

ER -