DistilBert weights initialization

I want to train a DistilBertModel from scratch with my own corpus, using BertModel as the teacher model. Following DistilBert paper, what’s the best way to initialize the weights of my DistilBert with part of the teacher model’s weights?

It seems both models are constructed using different classes (e.g. BertAttention in BertModel and MultiheadAttention in `DistilBertModel). In this case, I don’t know if I can just “assign” the teacher’s layers to the DistilBert’s layers…

Hi @meisyarahd, you can find the distillation example here

1 Like