Hi! First post in the forums, excited to start getting deep into this great library!
I have a rookie, theoretical question. I have been reading the DistilBERT paper (fantastic!) and was wondering if it makes sense to pretrain a DistilBERT model from scratch.
In the paper, the authors specify that “The student is trained with a distillation loss over the soft target probabilities of the teacher.”. My question is, when pretraining DistilBERT on a new corpus (say, another language) what are the ‘probabilities of the teacher’? AFAIK, the teacher does not have any interesting probabilites to show since it has never seen the corpus either.
So my question is, how does the transfomers library distill knowledge into the model when I train DistilBertForMaskedLM from scratch in a brand new corpus? Sorry in advance if there is something really obvious I’m missing, I’m quite new to using transformers.
Just to be extra explicit, I would load my model like this:
config = DistilBertConfig(vocab_size=VOCAB_SIZE)
model = DistilBertForMaskedLM(config)
and train it like this:
trainer = Trainer(
model=model,
args=training_args,
train_dataset=tokenized_datasets["train"],
eval_dataset=tokenized_datasets["valid"],
tokenizer=tokenizer,
data_collator=data_collator,
)
trainer.train()