While porting fairseq transformer and another model from allenai, I wasn’t getting the same BLEU scores as reported by the papers. At the end I learned that some of that difference was due to the fact that I was measuring the BLEU score in a different way from theirs. So when you see a BLEU number in a report, it could mean many different things. e.g. apparently you get a higher score if you measure tokenized outputs.
Please see this paper for many more nuances:
In your work and experiments, please, try to use sacrebleu
for measuring as suggested in the paper. That’s what our seq2seq eval_run.py
uses.
Thank you.