Significance tests of automatic machine translation evaluation metrics

Ying Zhang, Stephan Vogel

Research output: Contribution to journalArticle

5 Citations (Scopus)

Abstract

Automatic evaluation metrics for Machine Translation (MT) systems, such as BLEU, METEOR and the related NIST metric, are becoming increasingly important in MT research and development. This paper presents a significance testdriven comparison of n-gram-based automatic MT evaluation metrics. Statistical significance tests use bootstrapping methods to estimate the reliability of automatic machine translation evaluations. Based on this reliability estimation, we study the characteristics of different MT evaluation metrics and how to construct reliable and efficient evaluation suites.

Original languageEnglish
Pages (from-to)51-65
Number of pages15
JournalMachine Translation
Volume24
Issue number1
DOIs
Publication statusPublished - 1 Mar 2010
Externally publishedYes

Fingerprint

significance test
Statistical tests
evaluation
statistical significance
research and development
Evaluation
Machine Translation

Keywords

  • Bootstrap
  • Confidence interval
  • Evaluation suite construction
  • Machine translation evaluation
  • Significance test

ASJC Scopus subject areas

  • Artificial Intelligence
  • Software

Cite this

Significance tests of automatic machine translation evaluation metrics. / Zhang, Ying; Vogel, Stephan.

In: Machine Translation, Vol. 24, No. 1, 01.03.2010, p. 51-65.

Research output: Contribution to journalArticle

@article{394b2c21f2d44162973341778da20504,
title = "Significance tests of automatic machine translation evaluation metrics",
abstract = "Automatic evaluation metrics for Machine Translation (MT) systems, such as BLEU, METEOR and the related NIST metric, are becoming increasingly important in MT research and development. This paper presents a significance testdriven comparison of n-gram-based automatic MT evaluation metrics. Statistical significance tests use bootstrapping methods to estimate the reliability of automatic machine translation evaluations. Based on this reliability estimation, we study the characteristics of different MT evaluation metrics and how to construct reliable and efficient evaluation suites.",
keywords = "Bootstrap, Confidence interval, Evaluation suite construction, Machine translation evaluation, Significance test",
author = "Ying Zhang and Stephan Vogel",
year = "2010",
month = "3",
day = "1",
doi = "10.1007/s10590-010-9073-6",
language = "English",
volume = "24",
pages = "51--65",
journal = "Machine Translation",
issn = "0922-6567",
publisher = "Springer Netherlands",
number = "1",

}

TY - JOUR

T1 - Significance tests of automatic machine translation evaluation metrics

AU - Zhang, Ying

AU - Vogel, Stephan

PY - 2010/3/1

Y1 - 2010/3/1

N2 - Automatic evaluation metrics for Machine Translation (MT) systems, such as BLEU, METEOR and the related NIST metric, are becoming increasingly important in MT research and development. This paper presents a significance testdriven comparison of n-gram-based automatic MT evaluation metrics. Statistical significance tests use bootstrapping methods to estimate the reliability of automatic machine translation evaluations. Based on this reliability estimation, we study the characteristics of different MT evaluation metrics and how to construct reliable and efficient evaluation suites.

AB - Automatic evaluation metrics for Machine Translation (MT) systems, such as BLEU, METEOR and the related NIST metric, are becoming increasingly important in MT research and development. This paper presents a significance testdriven comparison of n-gram-based automatic MT evaluation metrics. Statistical significance tests use bootstrapping methods to estimate the reliability of automatic machine translation evaluations. Based on this reliability estimation, we study the characteristics of different MT evaluation metrics and how to construct reliable and efficient evaluation suites.

KW - Bootstrap

KW - Confidence interval

KW - Evaluation suite construction

KW - Machine translation evaluation

KW - Significance test

UR - http://www.scopus.com/inward/record.url?scp=78650039224&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=78650039224&partnerID=8YFLogxK

U2 - 10.1007/s10590-010-9073-6

DO - 10.1007/s10590-010-9073-6

M3 - Article

AN - SCOPUS:78650039224

VL - 24

SP - 51

EP - 65

JO - Machine Translation

JF - Machine Translation

SN - 0922-6567

IS - 1

ER -