I want to train a
DistilBertModel from scratch with my own corpus, using
BertModel as the teacher model. Following DistilBert paper, what’s the best way to initialize the weights of my DistilBert with part of the teacher model’s weights?
It seems both models are constructed using different classes (e.g.
MultiheadAttention in `DistilBertModel). In this case, I don’t know if I can just “assign” the teacher’s layers to the DistilBert’s layers…