Cannot reproduce the results

Hi I try to reproduce the result related to BART and the result is not comparable to the claimed performance. I tried sshleifer/distilbart-cnn-12-6 and facebook/bart-large-cnn and met the same problem.

My generation process is modified based on the released summarization pipeline.

python run_eval.py sshleifer/distilbart-cnn-12-6 $DATA_DIR/test.source $OUTPUT_FILE \
    --reference_path $DATA_DIR/test.target \
    --task summarization \
    --device cuda \
    --fp16 \
    --bs 32

My performance without post-processing:

1 ROUGE-1 Average_R: 0.48286 (95%-conf.int. 0.48036 - 0.48554)
1 ROUGE-1 Average_P: 0.33581 (95%-conf.int. 0.33356 - 0.33802)
1 ROUGE-1 Average_F: 0.38536 (95%-conf.int. 0.38338 - 0.38737)
---------------------------------------------
1 ROUGE-2 Average_R: 0.20405 (95%-conf.int. 0.20148 - 0.20648)
1 ROUGE-2 Average_P: 0.14260 (95%-conf.int. 0.14067 - 0.14449)
1 ROUGE-2 Average_F: 0.16314 (95%-conf.int. 0.16108 - 0.16517)
---------------------------------------------
1 ROUGE-L Average_R: 0.40419 (95%-conf.int. 0.40174 - 0.40665)
1 ROUGE-L Average_P: 0.28191 (95%-conf.int. 0.27984 - 0.28396)
1 ROUGE-L Average_F: 0.32309 (95%-conf.int. 0.32111 - 0.32509)

My performance with post-posting (from ProphetNet):

1 ROUGE-1 Average_R: 0.49758 (95%-conf.int. 0.49505 - 0.50028)
1 ROUGE-1 Average_P: 0.35663 (95%-conf.int. 0.35421 - 0.35889)
1 ROUGE-1 Average_F: 0.40406 (95%-conf.int. 0.40200 - 0.40607)
---------------------------------------------
1 ROUGE-2 Average_R: 0.21882 (95%-conf.int. 0.21622 - 0.22125)
1 ROUGE-2 Average_P: 0.15750 (95%-conf.int. 0.15543 - 0.15947)
1 ROUGE-2 Average_F: 0.17794 (95%-conf.int. 0.17576 - 0.17998)
---------------------------------------------
1 ROUGE-L Average_R: 0.41627 (95%-conf.int. 0.41375 - 0.41881)
1 ROUGE-L Average_P: 0.29928 (95%-conf.int. 0.29712 - 0.30132)
1 ROUGE-L Average_F: 0.33860 (95%-conf.int. 0.33658 - 0.34056)

The expected performance for sshleifer/distilbart-cnn-12-6 is ?/21.26/30.59 and I can only achieve 40.41/17.79/33.86. So is the trick related to the post-processing, or how can I achieve the expected performance?

Thank you!

For anyone may see this post, the problem is solved by using larger batch size. Above result is using batch size 32 on GeForce 2080 Ti, and I change to Tesla V100 today with batch size 128. The result is pretty closed to expected:

1 ROUGE-1 Average_R: 0.53399 (95%-conf.int. 0.53146 - 0.53669)
1 ROUGE-1 Average_P: 0.39205 (95%-conf.int. 0.38963 - 0.39451)
1 ROUGE-1 Average_F: 0.44179 (95%-conf.int. 0.43972 - 0.44408)
---------------------------------------------
1 ROUGE-2 Average_R: 0.25584 (95%-conf.int. 0.25312 - 0.25867)
1 ROUGE-2 Average_P: 0.18821 (95%-conf.int. 0.18605 - 0.19056)
1 ROUGE-2 Average_F: 0.21172 (95%-conf.int. 0.20940 - 0.21420)
---------------------------------------------
1 ROUGE-L Average_R: 0.44976 (95%-conf.int. 0.44719 - 0.45244)
1 ROUGE-L Average_P: 0.33090 (95%-conf.int. 0.32864 - 0.33330)
1 ROUGE-L Average_F: 0.37251 (95%-conf.int. 0.37040 - 0.37465)
1 Like

a little confused. why would batch-size affect result?

1 Like

Different batch size = different average gradients per batch. Depending on the total size of the data and the training time, this can have a big effect on final performance. Gradient accumulation should help with that.

2 Likes

cc @sshleifer

This post might also help

@valhalla I think that is a totally separate issue (and already fixed
in calculate_rouge_score on master)