Some weights of BertModel were not initialized from the model checkpoint

I able to train on a word level, after that i test with fill-mask pipeline and get below warning

I get this error :

Some weights of BertModel were not initialized from the model checkpoint at ./output_model and are newly initialized: ['bert.pooler.dense.weight', 'bert.pooler.dense.bias']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
tokenizer = PreTrainedTokenizerFast(tokenizer_file="./my-tokenizer.json")
model = BertForMaskedLM(config=BertConfig(vovab_size=1000000))
< after training>

fill_mask = pipeline(
    "fill-mask",
    model="./output_model",
    tokenizer=tokenizer
) # output warning above 

2 Likes

Hi,

I also get the same warning, while using AutoModelForMaskedLM on a fill-mask pipeline. Even though I finetuned it with AutoModelForMaskedLM…

Having randomly initialized layers should not be good for using the model. Is there any help to solve it?

This is interesting, thanks for reporting.

I’m opening an issue on Github as I’m encountering a similar issue.

The “solution” proposed in the issue you opened is not valid for the pipeline object (i.e. adding add_pooling_layer=False). Anyone still in the same situation? Did you solve it?

Hi,

Replying here to my past self, this is because the pooler is not part of the masked language model. This is not a problem, the warning just tells you that a BertModel gets instantiated with a pooler head besides the masked language modeling head. So if you’re doing masked language modeling you’re fine.

But I just want the barebone model to get the per token representations.

Do I get what I want with this flag add_pooling_layer=False?

What is a pooler head?

Hi,

A pooling layer is typically a linear projection (nn.Linear) or a MLP which “pools” (i.e. combines) all the token embeddings into a single embedding vector.

The simplest form of pooling is just averaging all token embeddings along the sequence dimension, i.e. pooler_output = torch.mean(last_hidden_state, dim=1). Alternatively, and that’s what BERT does, is first take the final hidden state (embedding) of the special CLS token, and apply the poling layer to it. This means:

last_hidden_state = outputs.last_hidden_state
# get embedding of special CLS token
cls_hidden_state=  last_hidden_state[:,0,:]
# apply pooler
pooler_output = self.pooler(cls_hidden_state)