Cannot reproduce the results

chz816 · October 2, 2020, 6:02pm

Hi I try to reproduce the result related to BART and the result is not comparable to the claimed performance. I tried sshleifer/distilbart-cnn-12-6 and facebook/bart-large-cnn and met the same problem.

My generation process is modified based on the released summarization pipeline.

python run_eval.py sshleifer/distilbart-cnn-12-6 $DATA_DIR/test.source $OUTPUT_FILE \
    --reference_path $DATA_DIR/test.target \
    --task summarization \
    --device cuda \
    --fp16 \
    --bs 32

My performance without post-processing:

1 ROUGE-1 Average_R: 0.48286 (95%-conf.int. 0.48036 - 0.48554)
1 ROUGE-1 Average_P: 0.33581 (95%-conf.int. 0.33356 - 0.33802)
1 ROUGE-1 Average_F: 0.38536 (95%-conf.int. 0.38338 - 0.38737)
---------------------------------------------
1 ROUGE-2 Average_R: 0.20405 (95%-conf.int. 0.20148 - 0.20648)
1 ROUGE-2 Average_P: 0.14260 (95%-conf.int. 0.14067 - 0.14449)
1 ROUGE-2 Average_F: 0.16314 (95%-conf.int. 0.16108 - 0.16517)
---------------------------------------------
1 ROUGE-L Average_R: 0.40419 (95%-conf.int. 0.40174 - 0.40665)
1 ROUGE-L Average_P: 0.28191 (95%-conf.int. 0.27984 - 0.28396)
1 ROUGE-L Average_F: 0.32309 (95%-conf.int. 0.32111 - 0.32509)

My performance with post-posting (from ProphetNet):

1 ROUGE-1 Average_R: 0.49758 (95%-conf.int. 0.49505 - 0.50028)
1 ROUGE-1 Average_P: 0.35663 (95%-conf.int. 0.35421 - 0.35889)
1 ROUGE-1 Average_F: 0.40406 (95%-conf.int. 0.40200 - 0.40607)
---------------------------------------------
1 ROUGE-2 Average_R: 0.21882 (95%-conf.int. 0.21622 - 0.22125)
1 ROUGE-2 Average_P: 0.15750 (95%-conf.int. 0.15543 - 0.15947)
1 ROUGE-2 Average_F: 0.17794 (95%-conf.int. 0.17576 - 0.17998)
---------------------------------------------
1 ROUGE-L Average_R: 0.41627 (95%-conf.int. 0.41375 - 0.41881)
1 ROUGE-L Average_P: 0.29928 (95%-conf.int. 0.29712 - 0.30132)
1 ROUGE-L Average_F: 0.33860 (95%-conf.int. 0.33658 - 0.34056)

The expected performance for sshleifer/distilbart-cnn-12-6 is ?/21.26/30.59 and I can only achieve 40.41/17.79/33.86. So is the trick related to the post-processing, or how can I achieve the expected performance?

Thank you!

chz816 · October 5, 2020, 2:34am

For anyone may see this post, the problem is solved by using larger batch size. Above result is using batch size 32 on GeForce 2080 Ti, and I change to Tesla V100 today with batch size 128. The result is pretty closed to expected:

1 ROUGE-1 Average_R: 0.53399 (95%-conf.int. 0.53146 - 0.53669)
1 ROUGE-1 Average_P: 0.39205 (95%-conf.int. 0.38963 - 0.39451)
1 ROUGE-1 Average_F: 0.44179 (95%-conf.int. 0.43972 - 0.44408)
---------------------------------------------
1 ROUGE-2 Average_R: 0.25584 (95%-conf.int. 0.25312 - 0.25867)
1 ROUGE-2 Average_P: 0.18821 (95%-conf.int. 0.18605 - 0.19056)
1 ROUGE-2 Average_F: 0.21172 (95%-conf.int. 0.20940 - 0.21420)
---------------------------------------------
1 ROUGE-L Average_R: 0.44976 (95%-conf.int. 0.44719 - 0.45244)
1 ROUGE-L Average_P: 0.33090 (95%-conf.int. 0.32864 - 0.33330)
1 ROUGE-L Average_F: 0.37251 (95%-conf.int. 0.37040 - 0.37465)

LiuYangyang · October 5, 2020, 6:57am

a little confused. why would batch-size affect result?

BramVanroy · October 5, 2020, 9:30am

Different batch size = different average gradients per batch. Depending on the total size of the data and the training time, this can have a big effect on final performance. Gradient accumulation should help with that.

valhalla · October 5, 2020, 10:36am

cc @sshleifer

This post might also help

sshleifer · October 5, 2020, 1:25pm

@valhalla I think that is a totally separate issue (and already fixed
in calculate_rouge_score on master)

Topic		Replies	Views
Facebook/bart-large-cnn has a low rouge score on cnn_dailymail Beginners	14	3226	October 5, 2020
Bart-base rouge scores Research	11	1730	October 27, 2020
Not able to reproduce the XSum rouge score with BART large model Models	0	330	January 22, 2022
Num_beams: Faster Summarization without Distillation 🤗Transformers	1	588	November 12, 2020
BART learns well, loss decreases, but prediction output is weird 🤗Transformers	2	193	March 3, 2024

Cannot reproduce the results

Related topics