Distilbart paper

Good evening,
Is there a paper about distilbart? I need documentation for my master thesis and I couldnâ€™t find any. Thanks for your help!

1 Like

Hi @Hildweig, There is no paper for distilbart, the idea of distllbart came from @sshleiferâ€™s great mind
You can find the details of the distillation process here

For the CNN models, the distiiled model is created by copying the alternating layers from `bart-large-cnn` . This is no teacher distillation i.e you just copy layers from teacher model and then fine-tune the student model in stander way.

And for XSUM it uses ombination of Distillbertâ€™s ce_loss and the hidden states MSE loss used in the tinybert paper.
DistilBERT paper
tinybert paper

2 Likes

Thanks a lot! So youâ€™re saying that we still can fine tine distilbart-cnn-12-6 on cnn dm?

distilbart-cnn-12-6 is already fine-tuned on cnn dm. The way it is done is first a student model is created as described above and then its finetuned on cnn dm.

You can create your own student(unfine-tuned) following the readme and then fine-tune it on cnn dm. Here are already available unfine-tuned students

2 Likes

Thanks a lot!

Iâ€™ll add that although thereâ€™s nothing out yet, there will likely be a paper at some point.

2 Likes

For the alternated layers, did he choose for example odd layers/ even layers? Or how does it work?

Pinging @sshleifer

1 Like

thx for the ping
for 12-6 itâ€™s `[0, 2, 4, 7, 9, 11]` bc I arbitrarily decided I wanted to keep the first and last layer in.
for 12-3 itâ€™s `[0,6,11]` for 12-1 itâ€™s `[0]`. 12-4 `[ 0, 4, 8, 11]` working way better than 12-3 for mbart-large-enro in early experiments.
Note: A full copy of the teacher would be 12-12.

``````            layers_to_copy = {  # maps  num layers in student -> which teacher layers to copy
1: [0],
2: [0, 6],
3: [0, 6, 11],
4: [0, 4, 8, 11],
6: [0, 2, 4, 7, 9, 11],
9: [0, 1, 2, 4, 5, 7, 9, 10, 11],
12: all_layers,
}
``````
3 Likes

Also, what training algorithm was used to train /finetune ? And the learning rate value for which it was fine-tuned on cnn dailymail?

Hi @Hildweig, it was fine-tuned the way you fine-tune/train any encoder-decoder Transformer model.
On a high level, the encoder takes the input sequence, creates a hidden representation of it.
The deocoder then receives the encoder representation, and is trained to generate the output sequence auto-regressively using teacher forcing.

You may have already read this, but The Illustrated Transformer by Jay Alammar explains it really well.

all the training hparams are in this file

@sshleifer is there a reason that the R-L scores in the sheet(~30) differ a lot from R-L scores reported by Facebook in the paper (~40)? (specifically on the CNN/DM dataset)

rouge package: https://github.com/huggingface/transformers/issues/6808

If Anyone is still reading this:

The Paper was released on 28.09.2020:
Pre-trained Summarization Distillation

3 Likes

@sshleifer Hi, Could you please share the hyperparameters that you use for Distilbart-xsum-12-3? I am trying to reproduce the result with KD loss but getting a lower result (21.18 Rouge-2 vs. 21.63 reported in the paper). Thank you!

They should be somewhere in the `research_projects/` directory, but things have moved around.
If you show me a command, I can check for glaring errors.
Before you retrain, make sure your `max_length` and `length_penalty` (may have been renamed) match my xsum-12-3 config in the model hub.

This is the command that I am using

``````train_distilbart_seq2seq --output_dir /tmp/path/to/dir/ \
--dataset_name xsum \
--learning_rate 3e-4 \ (taken from the [example script](https://github.com/huggingface/transformers/tree/master/examples/research_projects/seq2seq-distillation))
--use_kd_loss True \
--alpha_data 1.0 \
--alpha_logits 0.8 \
--alpha_hidden 3.0 \
--max_source_length 1024 \
--max_target_length 256 \
--do_train \
--do_eval \
--do_predict \
--num_beams 6 \
--num_train_epochs 5 \
--evaluation_strategy steps \
--save_total_limit 5 \
I did not specify `min_length`, `max_length`, and `length_penalty` as I let them take the values from the teacher model (`min_length=11, max_length=62`, which match the config in the model hub, I will need to double-check `length_penalty`). Other than that, please let me know if thereâ€™s anything wrong with my command. Thank you!
Command looks fine, but you may want different `length_penalty` than teacher.