Seq2Seq Distillation: Methodology Questions

This thread should be used to ask questions about how examples/seq2seq/ works, and to ask questions about the associated paper after it gets released.

1 Like

What is the reasoning behind choosing alternating layers ?
no teacher distillation scores for XSUM ?

no teacher is working for non seq-2-seq task as well as we saw with MNLI, should we also see if it works other tasks as well ?

Alternating layers seems to perform the best by a moderate amount.
Definitely interested to see results for other tasks!

1 Like

relocated to examples/research_projects/seq2seq-distillation/ ?

Yes, that project is now moved to research_projects dir.

Hey @sshleifer, I was trying to fine-tune the distill-pegasus-cnn-16-4 model provided by you but I am not sure of the hyper-parameters. Could you please share the hyper-parameters that you used to train this model (and achieve the results shown in Table 5 from your paper?

Thanks a lot!

Hi! have a question regarding the article «Pre-trained Summarization Distillation» ( In section 6.2, it is said «Table 10 shows results from fine-tuning teacher models…». However, throughout the paper it is stated that the experiments with pseudo-labeling only when fine-tuning the student model were performed. Is it a typo and the result from fine-tuning student models is indeed depicted?

Thanks in advance!