Why aren't all weights of BertForPreTraining initialized from the model checkpoint?

mgreenbe · October 4, 2021, 5:07pm

When I load a BertForPretraining with pretrained weights with

model_pretrain = BertForPreTraining.from_pretrained('bert-base-uncased')

I get the following warning:

Some weights of BertForPreTraining were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['cls.predictions.decoder.bias']

Why aren’t all the weights in cls.predictions initialized from the saved checkpoint?

The model seems to produce reliable token prediction outputs (without further training). In particular, it produces the same outputs as a model loaded with

model_masked = BertForMaskedLM.from_pretrained('bert-base-uncased')

Here’s code verifying this in an example:

s = ("Pop superstar Shakira says she was the [MASK] of a random [MASK] by a [MASK] "
     "of [MASK] boars while walking in a [MASK] in Barcelona with her eight-year-old "
     "[MASK].")
inputs = tokenizer(s, return_tensors='pt')

outputs_pretrain = model_pretrain(**inputs)
outputs_masked = model_masked(**inputs)
assert torch.allclose(outputs_pretrain["prediction_logits"], outputs_masked["logits"])

Incidentally, when loading model_masked, I don’t get a warning about newly initialized weights in cls.predictions. All newly initialized weights are in cls.seq_relationship, which is reasonable since if we only care about masked LM, the information from the base model regarding next sentence prediction can be safely thrown away.

nielsr · October 5, 2021, 7:23am

The BertForPreTraining model is BERT with 2 heads on top (the ones used for pre-training BERT, namely next sentence prediction and masked language modeling). The bert-base-uncased checkpoint on the hub only includes the language modeling head (it’s actually suited to be loaded into a BertForMaskedLM model). You can also see this in the config file here.

BramVanroy · October 5, 2021, 10:27am

Has the NSP head ever been included? I.e. are those weights available anywhere?

mgreenbe · October 5, 2021, 2:48pm

But aren’t the weights for the LM head stored on cls.predictions? If so, shouldn’t cls.predictions.decoder.bias be initialized from the checkpoint? The weights stored on cls.seq_relationship should be the randomly initialized ones, no?

Topic		Replies	Views
Are the weights of the maskedLM head of the `BertForMaskedLM` model pre-trained? 🤗Transformers	0	418	October 19, 2020
Weights of pre-trained BERT model not initialized 🤗Transformers	2	2076	March 11, 2021
Is "Some weights of the model were not used" warning normal when pre-trained BERT only by MLM Beginners	6	18437	March 28, 2024
DebertaForMaskedLM cannot load the parameters in the MLM head from microsoft/deberta-base Models	3	1324	April 29, 2022
Weights not downloading Beginners	3	1842	May 24, 2021

Why aren't all weights of BertForPreTraining initialized from the model checkpoint?

Related topics