Significance tests of automatic machine translation evaluation metrics

Ying Zhang, Stephan Vogel

Research output: Contribution to journalArticle

5 Citations (Scopus)


Automatic evaluation metrics for Machine Translation (MT) systems, such as BLEU, METEOR and the related NIST metric, are becoming increasingly important in MT research and development. This paper presents a significance testdriven comparison of n-gram-based automatic MT evaluation metrics. Statistical significance tests use bootstrapping methods to estimate the reliability of automatic machine translation evaluations. Based on this reliability estimation, we study the characteristics of different MT evaluation metrics and how to construct reliable and efficient evaluation suites.

Original languageEnglish
Pages (from-to)51-65
Number of pages15
JournalMachine Translation
Issue number1
Publication statusPublished - 1 Mar 2010



  • Bootstrap
  • Confidence interval
  • Evaluation suite construction
  • Machine translation evaluation
  • Significance test

ASJC Scopus subject areas

  • Software
  • Language and Linguistics
  • Linguistics and Language
  • Artificial Intelligence

Cite this