Interpreting BLEU/NIST scores: How much improvement do we need to have a better system?

Ying Zhang, Stephan Vogel, Alex Waibel

Research output: Chapter in Book/Report/Conference proceedingConference contribution

78 Citations (Scopus)

Abstract

Automatic evaluation metrics for Machine Translation (MT) systems, such as BLEU and the related NIST metric, are becoming increasingly important in MT. Yet, their behaviors are not fully understood. In this paper, we analyze some flaws in the BLEU/NIST metrics. With a better understanding of these problems, we can better interpret the reported BLEU/NIST scores. In addition, this paper reports a novel method of calculating the confidence intervals for BLEU/NIST scores using bootstrapping. With this method, we can determine whether two MT systems are significantly different from each other.

Original languageEnglish
Title of host publicationProceedings of the 4th International Conference on Language Resources and Evaluation, LREC 2004
PublisherEuropean Language Resources Association (ELRA)
Pages2051-2054
Number of pages4
ISBN (Electronic)2951740816, 9782951740815
Publication statusPublished - 1 Jan 2004
Externally publishedYes
Event4th International Conference on Language Resources and Evaluation, LREC 2004 - Lisbon, Portugal
Duration: 26 May 200428 May 2004

Other

Other4th International Conference on Language Resources and Evaluation, LREC 2004
CountryPortugal
CityLisbon
Period26/5/0428/5/04

Fingerprint

confidence
evaluation
Machine Translation System
Evaluation
Bootstrapping
Confidence Interval
Machine Translation

ASJC Scopus subject areas

  • Library and Information Sciences
  • Education
  • Language and Linguistics
  • Linguistics and Language

Cite this

Zhang, Y., Vogel, S., & Waibel, A. (2004). Interpreting BLEU/NIST scores: How much improvement do we need to have a better system? In Proceedings of the 4th International Conference on Language Resources and Evaluation, LREC 2004 (pp. 2051-2054). European Language Resources Association (ELRA).

Interpreting BLEU/NIST scores : How much improvement do we need to have a better system? / Zhang, Ying; Vogel, Stephan; Waibel, Alex.

Proceedings of the 4th International Conference on Language Resources and Evaluation, LREC 2004. European Language Resources Association (ELRA), 2004. p. 2051-2054.

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Zhang, Y, Vogel, S & Waibel, A 2004, Interpreting BLEU/NIST scores: How much improvement do we need to have a better system? in Proceedings of the 4th International Conference on Language Resources and Evaluation, LREC 2004. European Language Resources Association (ELRA), pp. 2051-2054, 4th International Conference on Language Resources and Evaluation, LREC 2004, Lisbon, Portugal, 26/5/04.
Zhang Y, Vogel S, Waibel A. Interpreting BLEU/NIST scores: How much improvement do we need to have a better system? In Proceedings of the 4th International Conference on Language Resources and Evaluation, LREC 2004. European Language Resources Association (ELRA). 2004. p. 2051-2054
Zhang, Ying ; Vogel, Stephan ; Waibel, Alex. / Interpreting BLEU/NIST scores : How much improvement do we need to have a better system?. Proceedings of the 4th International Conference on Language Resources and Evaluation, LREC 2004. European Language Resources Association (ELRA), 2004. pp. 2051-2054
@inproceedings{8394523eea784ca0886b0a739b2728b2,
title = "Interpreting BLEU/NIST scores: How much improvement do we need to have a better system?",
abstract = "Automatic evaluation metrics for Machine Translation (MT) systems, such as BLEU and the related NIST metric, are becoming increasingly important in MT. Yet, their behaviors are not fully understood. In this paper, we analyze some flaws in the BLEU/NIST metrics. With a better understanding of these problems, we can better interpret the reported BLEU/NIST scores. In addition, this paper reports a novel method of calculating the confidence intervals for BLEU/NIST scores using bootstrapping. With this method, we can determine whether two MT systems are significantly different from each other.",
author = "Ying Zhang and Stephan Vogel and Alex Waibel",
year = "2004",
month = "1",
day = "1",
language = "English",
pages = "2051--2054",
booktitle = "Proceedings of the 4th International Conference on Language Resources and Evaluation, LREC 2004",
publisher = "European Language Resources Association (ELRA)",

}

TY - GEN

T1 - Interpreting BLEU/NIST scores

T2 - How much improvement do we need to have a better system?

AU - Zhang, Ying

AU - Vogel, Stephan

AU - Waibel, Alex

PY - 2004/1/1

Y1 - 2004/1/1

N2 - Automatic evaluation metrics for Machine Translation (MT) systems, such as BLEU and the related NIST metric, are becoming increasingly important in MT. Yet, their behaviors are not fully understood. In this paper, we analyze some flaws in the BLEU/NIST metrics. With a better understanding of these problems, we can better interpret the reported BLEU/NIST scores. In addition, this paper reports a novel method of calculating the confidence intervals for BLEU/NIST scores using bootstrapping. With this method, we can determine whether two MT systems are significantly different from each other.

AB - Automatic evaluation metrics for Machine Translation (MT) systems, such as BLEU and the related NIST metric, are becoming increasingly important in MT. Yet, their behaviors are not fully understood. In this paper, we analyze some flaws in the BLEU/NIST metrics. With a better understanding of these problems, we can better interpret the reported BLEU/NIST scores. In addition, this paper reports a novel method of calculating the confidence intervals for BLEU/NIST scores using bootstrapping. With this method, we can determine whether two MT systems are significantly different from each other.

UR - http://www.scopus.com/inward/record.url?scp=84921700653&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=84921700653&partnerID=8YFLogxK

M3 - Conference contribution

AN - SCOPUS:84921700653

SP - 2051

EP - 2054

BT - Proceedings of the 4th International Conference on Language Resources and Evaluation, LREC 2004

PB - European Language Resources Association (ELRA)

ER -