Distilbart paper

Good evening,
Is there a paper about distilbart? I need documentation for my master thesis and I couldn’t find any. Thanks for your help!

1 Like

Hi @Hildweig, There is no paper for distilbart, the idea of distllbart came from @sshleifer’s great mind :wink:
You can find the details of the distillation process here

For the CNN models, the distiiled model is created by copying the alternating layers from bart-large-cnn . This is no teacher distillation i.e you just copy layers from teacher model and then fine-tune the student model in stander way.

And for XSUM it uses ombination of Distillbert’s ce_loss and the hidden states MSE loss used in the tinybert paper.
DistilBERT paper
tinybert paper


Thanks a lot! So you’re saying that we still can fine tine distilbart-cnn-12-6 on cnn dm?

distilbart-cnn-12-6 is already fine-tuned on cnn dm. The way it is done is first a student model is created as described above and then its finetuned on cnn dm.

You can create your own student(unfine-tuned) following the readme and then fine-tune it on cnn dm. Here are already available unfine-tuned students


Thanks a lot!

I’ll add that although there’s nothing out yet, there will likely be a paper at some point.


For the alternated layers, did he choose for example odd layers/ even layers? Or how does it work?

Pinging @sshleifer

1 Like

thx for the ping
for 12-6 it’s [0, 2, 4, 7, 9, 11] bc I arbitrarily decided I wanted to keep the first and last layer in.
for 12-3 it’s [0,6,11] for 12-1 it’s [0]. 12-4 [ 0, 4, 8, 11] working way better than 12-3 for mbart-large-enro in early experiments.
Note: A full copy of the teacher would be 12-12.

Code: https://github.com/huggingface/transformers/blob/09a2f40684f77e62d0fd8485fe9d2d610390453f/examples/seq2seq/distillation.py#L403

            layers_to_copy = {  # maps  num layers in student -> which teacher layers to copy
                1: [0],
                2: [0, 6],
                3: [0, 6, 11],
                4: [0, 4, 8, 11],
                6: [0, 2, 4, 7, 9, 11],
                9: [0, 1, 2, 4, 5, 7, 9, 10, 11],
                12: all_layers,

Also, what training algorithm was used to train /finetune ? And the learning rate value for which it was fine-tuned on cnn dailymail?

Hi @Hildweig, it was fine-tuned the way you fine-tune/train any encoder-decoder Transformer model.
On a high level, the encoder takes the input sequence, creates a hidden representation of it.
The deocoder then receives the encoder representation, and is trained to generate the output sequence auto-regressively using teacher forcing.

You may have already read this, but The Illustrated Transformer by Jay Alammar explains it really well.

all the training hparams are in this file

@sshleifer is there a reason that the R-L scores in the sheet(~30) differ a lot from R-L scores reported by Facebook in the paper (~40)? (specifically on the CNN/DM dataset)

rouge package: https://github.com/huggingface/transformers/issues/6808

If Anyone is still reading this:

The Paper was released on 28.09.2020:
Pre-trained Summarization Distillation


@sshleifer Hi, Could you please share the hyperparameters that you use for Distilbart-xsum-12-3? I am trying to reproduce the result with KD loss but getting a lower result (21.18 Rouge-2 vs. 21.63 reported in the paper). Thank you!

They should be somewhere in the research_projects/ directory, but things have moved around.
If you show me a command, I can check for glaring errors.
Before you retrain, make sure your max_length and length_penalty (may have been renamed) match my xsum-12-3 config in the model hub.

This is the command that I am using

train_distilbart_seq2seq --output_dir /tmp/path/to/dir/ \
                         --model_name facebook/bart-large-xsum \
                         --tokenizer_name facebook/bart-large-xsum \
                         --task summarization \
                         --dataset_name xsum \
                         --learning_rate 3e-4 \ (taken from the [example script](https://github.com/huggingface/transformers/tree/master/examples/research_projects/seq2seq-distillation))
                         --use_kd_loss True \
                         --alpha_data 1.0 \
                         --alpha_logits 0.8 \
                         --alpha_hidden 3.0 \
                         --max_source_length 1024 \
                         --max_target_length 256 \
                         --do_train \
                         --do_eval \
                         --do_predict \
                         --num_beams 6 \
                         --num_train_epochs 5 \
                         --evaluation_strategy steps \
                         --save_total_limit 5 \
                         --load_best_model_at_end \

I did not specify min_length, max_length, and length_penalty as I let them take the values from the teacher model (min_length=11, max_length=62, which match the config in the model hub, I will need to double-check length_penalty). Other than that, please let me know if there’s anything wrong with my command. Thank you!

Command looks fine, but you may want different length_penalty than teacher.