Distilbart paper

Good evening,
Is there a paper about distilbart? I need documentation for my master thesis and I couldn’t find any. Thanks for your help!

Hi @Hildweig, There is no paper for distilbart, the idea of distllbart came from @sshleifer’s great mind :wink:
You can find the details of the distillation process here

For the CNN models, the distiiled model is created by copying the alternating layers from bart-large-cnn . This is no teacher distillation i.e you just copy layers from teacher model and then fine-tune the student model in stander way.

And for XSUM it uses ombination of Distillbert’s ce_loss and the hidden states MSE loss used in the tinybert paper.
DistilBERT paper
tinybert paper


Thanks a lot! So you’re saying that we still can fine tine distilbart-cnn-12-6 on cnn dm?

distilbart-cnn-12-6 is already fine-tuned on cnn dm. The way it is done is first a student model is created as described above and then its finetuned on cnn dm.

You can create your own student(unfine-tuned) following the readme and then fine-tune it on cnn dm. Here are already available unfine-tuned students


Thanks a lot!

I’ll add that although there’s nothing out yet, there will likely be a paper at some point.

1 Like

For the alternated layers, did he choose for example odd layers/ even layers? Or how does it work?

Pinging @sshleifer

1 Like

thx for the ping
for 12-6 it’s [0, 2, 4, 7, 9, 11] bc I arbitrarily decided I wanted to keep the first and last layer in.
for 12-3 it’s [0,6,11] for 12-1 it’s [0]. 12-4 [ 0, 4, 8, 11] working way better than 12-3 for mbart-large-enro in early experiments.
Note: A full copy of the teacher would be 12-12.

Code: https://github.com/huggingface/transformers/blob/09a2f40684f77e62d0fd8485fe9d2d610390453f/examples/seq2seq/distillation.py#L403

            layers_to_copy = {  # maps  num layers in student -> which teacher layers to copy
                1: [0],
                2: [0, 6],
                3: [0, 6, 11],
                4: [0, 4, 8, 11],
                6: [0, 2, 4, 7, 9, 11],
                9: [0, 1, 2, 4, 5, 7, 9, 10, 11],
                12: all_layers,

Also, what training algorithm was used to train /finetune ? And the learning rate value for which it was fine-tuned on cnn dailymail?

Hi @Hildweig, it was fine-tuned the way you fine-tune/train any encoder-decoder Transformer model.
On a high level, the encoder takes the input sequence, creates a hidden representation of it.
The deocoder then receives the encoder representation, and is trained to generate the output sequence auto-regressively using teacher forcing.

You may have already read this, but The Illustrated Transformer by Jay Alammar explains it really well.

all the training hparams are in this file