Some weights of BertModel were not initialized from the model checkpoint

cometta · February 22, 2021, 8:07am

I able to train on a word level, after that i test with fill-mask pipeline and get below warning

I get this error :

Some weights of BertModel were not initialized from the model checkpoint at ./output_model and are newly initialized: ['bert.pooler.dense.weight', 'bert.pooler.dense.bias']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.

tokenizer = PreTrainedTokenizerFast(tokenizer_file="./my-tokenizer.json")
model = BertForMaskedLM(config=BertConfig(vovab_size=1000000))
< after training>

fill_mask = pipeline(
    "fill-mask",
    model="./output_model",
    tokenizer=tokenizer
) # output warning above

alihan · October 12, 2021, 6:59pm

Hi,

I also get the same warning, while using AutoModelForMaskedLM on a fill-mask pipeline. Even though I finetuned it with AutoModelForMaskedLM…

Having randomly initialized layers should not be good for using the model. Is there any help to solve it?

nielsr · October 15, 2021, 7:41am

This is interesting, thanks for reporting.

I’m opening an issue on Github as I’m encountering a similar issue.

victorivus · January 10, 2024, 12:01pm

The “solution” proposed in the issue you opened is not valid for the pipeline object (i.e. adding add_pooling_layer=False). Anyone still in the same situation? Did you solve it?

nielsr · January 10, 2024, 6:52pm

Hi,

Replying here to my past self, this is because the pooler is not part of the masked language model. This is not a problem, the warning just tells you that a BertModel gets instantiated with a pooler head besides the masked language modeling head. So if you’re doing masked language modeling you’re fine.

Ruite · January 15, 2024, 11:34am

But I just want the barebone model to get the per token representations.

Do I get what I want with this flag add_pooling_layer=False?

What is a pooler head?

nielsr · January 15, 2024, 2:02pm

Hi,

A pooling layer is typically a linear projection (nn.Linear) or a MLP which “pools” (i.e. combines) all the token embeddings into a single embedding vector.

The simplest form of pooling is just averaging all token embeddings along the sequence dimension, i.e. pooler_output = torch.mean(last_hidden_state, dim=1). Alternatively, and that’s what BERT does, is first take the final hidden state (embedding) of the special CLS token, and apply the poling layer to it. This means:

last_hidden_state = outputs.last_hidden_state
# get embedding of special CLS token
cls_hidden_state=  last_hidden_state[:,0,:]
# apply pooler
pooler_output = self.pooler(cls_hidden_state)

Topic		Replies	Views
Uninitiallized weights with supposed correct architecture Models	1	330	October 6, 2023
Why aren't all weights of BertForPreTraining initialized from the model checkpoint? Beginners	3	1588	October 5, 2021
Do I need to worry about this bert.dense.pooler training warning for my usecase? Models	0	812	March 25, 2022
Weights of pre-trained BERT model not initialized 🤗Transformers	2	2076	March 11, 2021
"Some weights were not used" message with AutoModel Beginners	4	1935	May 21, 2024

Some weights of BertModel were not initialized from the model checkpoint

Related topics