Hi, I am interested in the BertForMaskedLM model, and from the documentation it seems like it could predict the likelihood of a masked token directly from the pretrained BERT model?

However from looking at the code:

It seems that there is an extra LM head that projects a linear layer on top of the output hidden vectors which is then dot-producted with the vocabulary to produce the likelihood. I am wondering how the weights for this head is loaded from (as from_pretrained should only load the weights for the BERT encoder right?) or is it set to some default value each time (I noticed running the model gave the same value each time) and if using BERT in this way to predict the likelihood of a masked token requires pre-training or given the pre-training that BERT goes through would be unnecessary?