Is "Some weights of the model were not used" warning normal when pre-trained BERT only by MLM

Hello guys,

I’ve trained BERT model from the scratch using BertForMaskedLM and trainers. When I use AutoModelForSequenceClassification to fine-tune my model for a text classification task, I get a warrning about weights initialization. Is it normal to get a warning such as in the below or am I doing something wrong ?

Some weights of the model checkpoint at ./cased/bert-wikidump-50mb-mlm/model were not used when initializing BertForSequenceClassification: ['cls.predictions.bias', 'cls.predictions.transform.dense.weight', 'cls.predictions.transform.dense.bias', 'cls.predictions.transform.LayerNorm.weight', 'cls.predictions.transform.LayerNorm.bias', 'cls.predictions.decoder.weight', 'cls.predictions.decoder.bias']
- This IS expected if you are initializing BertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of BertForSequenceClassification were not initialized from the model checkpoint at ./cased/bert-wikidump-50mb-mlm/model and are newly initialized: ['bert.pooler.dense.weight', 'bert.pooler.dense.bias', 'classifier.weight', 'classifier.bias']
Loading pre-trained model with AutoModelForSequenceClassification
from transformers import AutoModelForSequenceClassification, AdamW, AutoConfig
config = AutoConfig.from_pretrained(PATHS["model"]["cased"]["local"], num_labels=df.category.unique().size)

model = AutoModelForSequenceClassification.from_pretrained(PATHS["model"]["cased"]["local"], config=config)
Code for training BERT from scratch with only MLM task
from transformers import BertConfig
config = BertConfig(vocab_size=64_000)

from transformers import BertForMaskedLM
model = BertForMaskedLM(config=config)

from transformers import Trainer, TrainingArguments

from transformers import DataCollatorForLanguageModeling

data_collator = DataCollatorForLanguageModeling(
    tokenizer=bert_cased_tokenizer, mlm=True, mlm_probability=0.15

training_args = TrainingArguments(
    per_gpu_train_batch_size= 8, ## 512 max sequence lenght, 64 sequence count

trainer = Trainer(

Yes, the warning is telling you that some weights were randomly initialized (here you classification head), which is normal since you are instantiating a pretrained model for a different task. It’s there to remind you to finetune your model (it’s not usable for inference directly).


Thanks @sgugger ! :slightly_smiling_face:

Hi @sgugger , may I know how to suppress this information? I load pretrained model and multiprocess this , so this notification is visually overwhelming. Thank you!