Num_beams: Faster Summarization without Distillation

@patrickvonplaten @valhalla @stas

For many seq2seq models in the hub, num_beams can be set meaningfully lower without hurting metrics.
For xsum, cnn, I tried a bunch of different values and decided these would be the better. (and don’t list if the default is good). The defaults are 8 for all pegasus, 6 for bart*xsum and 4 for bart*cnn. It’s not clear whether to change defaults from the published parameters, (it would be nice to save compute for pipelines and inference API, though) so I figured I’d just post this if people want faster inference. The speedups are substantial: between 20% and 100%. Tends to be easier on cnn_dailymail than xsum.

google/pegasus-cnn_dailymail: 4
sshleifer/distill-pegasus-cnn-16-4: 4
sshleifer/pegasus-cnn-ft-v2: 4
sshleifer/distilbart-cnn-12-3: 3
sshleifer/distilbart-cnn-12-6: 2
sshleifer/distilbart-cnn-6-6: 2
sshleifer/distill-pegasus-xsum-16-4: 4
sshleifer/distill-pegasus-xsum-12-12: 4

Here are some rouge2 vs num_beams plots for different models



Another note:

facebook/bart-large-xsum: prefix=" " hurts rouge2 by .02. Should be removed. No impact on facebook/bart-large-cnn


Awesome sharing, @sshleifer!

Perhaps let’s add these notes to Otherwise it’d be difficult to remember that this is on the forums - or perhaps create with various such performance notes - so README focuses on functionality, and the latter for tips and tricks.