I want to train a DistilBertModel
from scratch with my own corpus, using BertModel
as the teacher model. Following DistilBert paper, what’s the best way to initialize the weights of my DistilBert with part of the teacher model’s weights?
It seems both models are constructed using different classes (e.g. BertAttention
in BertModel
and MultiheadAttention
in `DistilBertModel). In this case, I don’t know if I can just “assign” the teacher’s layers to the DistilBert’s layers…