Seq2Seq Distillation: Methodology Questions

sshleifer · September 27, 2020, 6:17pm

This thread should be used to ask questions about how examples/seq2seq/distillation.py works, and to ask questions about the associated paper after it gets released.

valhalla · September 28, 2020, 9:37am

What is the reasoning behind choosing alternating layers ?
no teacher distillation scores for XSUM ?

no teacher is working for non seq-2-seq task as well as we saw with MNLI, should we also see if it works other tasks as well ?

sshleifer · October 5, 2020, 1:24pm

Alternating layers seems to perform the best by a moderate amount.
Definitely interested to see results for other tasks!

surajp · December 17, 2020, 5:26am

relocated to examples/research_projects/seq2seq-distillation/distillation.py ?

valhalla · December 17, 2020, 6:44am

Yes, that project is now moved to research_projects dir.

nbansal · June 25, 2021, 6:59am

Hey @sshleifer, I was trying to fine-tune the distill-pegasus-cnn-16-4 model provided by you but I am not sure of the hyper-parameters. Could you please share the hyper-parameters that you used to train this model (and achieve the results shown in Table 5 from your paper?

Thanks a lot!
Naman

Aktsvigun · June 3, 2022, 4:24pm

Hi! have a question regarding the article «Pre-trained Summarization Distillation» (https://arxiv.org/pdf/2010.13002.pdf). In section 6.2, it is said «Table 10 shows results from fine-tuning teacher models…». However, throughout the paper it is stated that the experiments with pseudo-labeling only when fine-tuning the student model were performed. Is it a typo and the result from fine-tuning student models is indeed depicted?

Thanks in advance!

ppaudel · August 7, 2023, 4:12pm

Hi @sshleifer . Any thoughts on if the T5 distillation would still be feasible with PEFT techniques such as LORA? I have a fine tuned T5-11B using LORA and want to distill this model to something feasible like T5-base or even T5-large. But I’m not sure if the teacher model , which essentially has a LoRA adapter work on a similar way ? Any thoughts / ideas regarding this would be great help. Thanks

Topic		Replies	Views
Distilbart paper 🤗Transformers	17	2099	March 27, 2021
Questions on distilling [from] T5 🤗Transformers	15	4789	August 2, 2022
Distillation: create student model from a different base model than teacher 🤗Transformers	9	2096	October 14, 2020
Regarding Training a Task Specific Knowledge Distillation model 🤗Transformers	8	3421	September 2, 2023
Knowledge Distillation of SentenceTransformer - problems making it work Beginners	0	1064	April 9, 2022

Seq2Seq Distillation: Methodology Questions

Related topics