Not all BLEU scores were created equal

While porting fairseq transformer and another model from allenai, I wasn’t getting the same BLEU scores as reported by the papers. At the end I learned that some of that difference was due to the fact that I was measuring the BLEU score in a different way from theirs. So when you see a BLEU number in a report, it could mean many different things. e.g. apparently you get a higher score if you measure tokenized outputs.

Please see this paper for many more nuances:

In your work and experiments, please, try to use sacrebleu for measuring as suggested in the paper. That’s what our seq2seq uses.

Thank you.