Why aren't all weights of BertForPreTraining initialized from the model checkpoint?

When I load a BertForPretraining with pretrained weights with

model_pretrain = BertForPreTraining.from_pretrained('bert-base-uncased')

I get the following warning:

Some weights of BertForPreTraining were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['cls.predictions.decoder.bias']

Why aren’t all the weights in cls.predictions initialized from the saved checkpoint?

The model seems to produce reliable token prediction outputs (without further training). In particular, it produces the same outputs as a model loaded with

model_masked = BertForMaskedLM.from_pretrained('bert-base-uncased')

Here’s code verifying this in an example:

s = ("Pop superstar Shakira says she was the [MASK] of a random [MASK] by a [MASK] "
     "of [MASK] boars while walking in a [MASK] in Barcelona with her eight-year-old "
     "[MASK].")
inputs = tokenizer(s, return_tensors='pt')

outputs_pretrain = model_pretrain(**inputs)
outputs_masked = model_masked(**inputs)
assert torch.allclose(outputs_pretrain["prediction_logits"], outputs_masked["logits"])

Incidentally, when loading model_masked, I don’t get a warning about newly initialized weights in cls.predictions. All newly initialized weights are in cls.seq_relationship, which is reasonable since if we only care about masked LM, the information from the base model regarding next sentence prediction can be safely thrown away.

The BertForPreTraining model is BERT with 2 heads on top (the ones used for pre-training BERT, namely next sentence prediction and masked language modeling). The bert-base-uncased checkpoint on the hub only includes the language modeling head (it’s actually suited to be loaded into a BertForMaskedLM model). You can also see this in the config file here.

Has the NSP head ever been included? I.e. are those weights available anywhere?

But aren’t the weights for the LM head stored on cls.predictions? If so, shouldn’t cls.predictions.decoder.bias be initialized from the checkpoint? The weights stored on cls.seq_relationship should be the randomly initialized ones, no?