Good evening,
Is there a paper about distilbart? I need documentation for my master thesis and I couldn’t find any. Thanks for your help!
Hi @Hildweig, There is no paper for distilbart, the idea of distllbart came from @sshleifer’s great mind
You can find the details of the distillation process here
For the CNN models, the distiiled model is created by copying the alternating layers from bart-large-cnn
. This is no teacher distillation i.e you just copy layers from teacher model and then fine-tune the student model in stander way.
And for XSUM it uses ombination of Distillbert’s ce_loss and the hidden states MSE loss used in the tinybert paper.
DistilBERT paper
tinybert paper
Thanks a lot! So you’re saying that we still can fine tine distilbart-cnn-12-6 on cnn dm?
distilbart-cnn-12-6 is already fine-tuned on cnn dm. The way it is done is first a student model is created as described above and then its finetuned on cnn dm.
You can create your own student(unfine-tuned) following the readme and then fine-tune it on cnn dm. Here are already available unfine-tuned students
Thanks a lot!
I’ll add that although there’s nothing out yet, there will likely be a paper at some point.
For the alternated layers, did he choose for example odd layers/ even layers? Or how does it work?
Pinging @sshleifer
thx for the ping
for 12-6 it’s [0, 2, 4, 7, 9, 11]
bc I arbitrarily decided I wanted to keep the first and last layer in.
for 12-3 it’s [0,6,11]
for 12-1 it’s [0]
. 12-4 [ 0, 4, 8, 11]
working way better than 12-3 for mbart-large-enro in early experiments.
Note: A full copy of the teacher would be 12-12.
layers_to_copy = { # maps num layers in student -> which teacher layers to copy
1: [0],
2: [0, 6],
3: [0, 6, 11],
4: [0, 4, 8, 11],
6: [0, 2, 4, 7, 9, 11],
9: [0, 1, 2, 4, 5, 7, 9, 10, 11],
12: all_layers,
}
Also, what training algorithm was used to train /finetune ? And the learning rate value for which it was fine-tuned on cnn dailymail?
Hi @Hildweig, it was fine-tuned the way you fine-tune/train any encoder-decoder Transformer model.
On a high level, the encoder takes the input sequence, creates a hidden representation of it.
The deocoder then receives the encoder representation, and is trained to generate the output sequence auto-regressively using teacher forcing.
You may have already read this, but The Illustrated Transformer by Jay Alammar explains it really well.
all the training hparams are in this file
@sshleifer is there a reason that the R-L scores in the sheet(~30) differ a lot from R-L scores reported by Facebook in the paper (~40)? (specifically on the CNN/DM dataset)
rouge package: https://github.com/huggingface/transformers/issues/6808
If Anyone is still reading this:
The Paper was released on 28.09.2020:
Pre-trained Summarization Distillation
@sshleifer Hi, Could you please share the hyperparameters that you use for Distilbart-xsum-12-3? I am trying to reproduce the result with KD loss but getting a lower result (21.18 Rouge-2 vs. 21.63 reported in the paper). Thank you!
They should be somewhere in the research_projects/
directory, but things have moved around.
If you show me a command, I can check for glaring errors.
Before you retrain, make sure your max_length
and length_penalty
(may have been renamed) match my xsum-12-3 config in the model hub.
This is the command that I am using
train_distilbart_seq2seq --output_dir /tmp/path/to/dir/ \
--model_name facebook/bart-large-xsum \
--tokenizer_name facebook/bart-large-xsum \
--task summarization \
--dataset_name xsum \
--learning_rate 3e-4 \ (taken from the [example script](https://github.com/huggingface/transformers/tree/master/examples/research_projects/seq2seq-distillation))
--use_kd_loss True \
--alpha_data 1.0 \
--alpha_logits 0.8 \
--alpha_hidden 3.0 \
--max_source_length 1024 \
--max_target_length 256 \
--do_train \
--do_eval \
--do_predict \
--num_beams 6 \
--num_train_epochs 5 \
--evaluation_strategy steps \
--save_total_limit 5 \
--load_best_model_at_end \
--predict_with_generate
I did not specify min_length
, max_length
, and length_penalty
as I let them take the values from the teacher model (min_length=11, max_length=62
, which match the config in the model hub, I will need to double-check length_penalty
). Other than that, please let me know if there’s anything wrong with my command. Thank you!
Command looks fine, but you may want different length_penalty
than teacher.