Interpreting BLEU/NIST scores: How much improvement do we need to have a better system?

Ying Zhang, Stephan Vogel, Alex Waibel

Research output: Chapter in Book/Report/Conference proceedingConference contribution

79 Citations (Scopus)

Abstract

Automatic evaluation metrics for Machine Translation (MT) systems, such as BLEU and the related NIST metric, are becoming increasingly important in MT. Yet, their behaviors are not fully understood. In this paper, we analyze some flaws in the BLEU/NIST metrics. With a better understanding of these problems, we can better interpret the reported BLEU/NIST scores. In addition, this paper reports a novel method of calculating the confidence intervals for BLEU/NIST scores using bootstrapping. With this method, we can determine whether two MT systems are significantly different from each other.

Original languageEnglish
Title of host publicationProceedings of the 4th International Conference on Language Resources and Evaluation, LREC 2004
PublisherEuropean Language Resources Association (ELRA)
Pages2051-2054
Number of pages4
ISBN (Electronic)2951740816, 9782951740815
Publication statusPublished - 1 Jan 2004
Externally publishedYes
Event4th International Conference on Language Resources and Evaluation, LREC 2004 - Lisbon, Portugal
Duration: 26 May 200428 May 2004

Other

Other4th International Conference on Language Resources and Evaluation, LREC 2004
CountryPortugal
CityLisbon
Period26/5/0428/5/04

    Fingerprint

ASJC Scopus subject areas

  • Library and Information Sciences
  • Education
  • Language and Linguistics
  • Linguistics and Language

Cite this

Zhang, Y., Vogel, S., & Waibel, A. (2004). Interpreting BLEU/NIST scores: How much improvement do we need to have a better system? In Proceedings of the 4th International Conference on Language Resources and Evaluation, LREC 2004 (pp. 2051-2054). European Language Resources Association (ELRA).