Not all BLEU scores were created equal

stas · September 15, 2020, 11:45pm

While porting fairseq transformer and another model from allenai, I wasn’t getting the same BLEU scores as reported by the papers. At the end I learned that some of that difference was due to the fact that I was measuring the BLEU score in a different way from theirs. So when you see a BLEU number in a report, it could mean many different things. e.g. apparently you get a higher score if you measure tokenized outputs.

Please see this paper for many more nuances:

In your work and experiments, please, try to use sacrebleu for measuring as suggested in the paper. That’s what our seq2seq eval_run.py uses.

Thank you.

Topic		Replies	Views
Inconsistent Bleu score between test_metrics['test_bleu'] and written-to-file test_metric.predictions Beginners	3	1572	May 26, 2021
Problems with trainer.compute_metrics 🤗Transformers	1	215	September 15, 2024
[new model] FSMT has been released + 9 models ported 🤗Transformers	3	1146	September 25, 2020
Compute the BLEU using pretrained T5-small Models	2	3981	April 13, 2022
Can the blue metric be used to train T5? Models	0	442	November 22, 2022

Not all BLEU scores were created equal

Related topics