Is "Some weights of the model were not used" warning normal when pre-trained BERT only by MLM

Hello guys,

I’ve trained BERT model from the scratch using BertForMaskedLM and trainers. When I use AutoModelForSequenceClassification to fine-tune my model for a text classification task, I get a warrning about weights initialization. Is it normal to get a warning such as in the below or am I doing something wrong ?

Some weights of the model checkpoint at ./cased/bert-wikidump-50mb-mlm/model were not used when initializing BertForSequenceClassification: ['cls.predictions.bias', 'cls.predictions.transform.dense.weight', 'cls.predictions.transform.dense.bias', 'cls.predictions.transform.LayerNorm.weight', 'cls.predictions.transform.LayerNorm.bias', 'cls.predictions.decoder.weight', 'cls.predictions.decoder.bias']
- This IS expected if you are initializing BertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of BertForSequenceClassification were not initialized from the model checkpoint at ./cased/bert-wikidump-50mb-mlm/model and are newly initialized: ['bert.pooler.dense.weight', 'bert.pooler.dense.bias', 'classifier.weight', 'classifier.bias']
Loading pre-trained model with AutoModelForSequenceClassification
from transformers import AutoModelForSequenceClassification, AdamW, AutoConfig
config = AutoConfig.from_pretrained(PATHS["model"]["cased"]["local"], num_labels=df.category.unique().size)

model = AutoModelForSequenceClassification.from_pretrained(PATHS["model"]["cased"]["local"], config=config)
Code for training BERT from scratch with only MLM task
from transformers import BertConfig
config = BertConfig(vocab_size=64_000)

from transformers import BertForMaskedLM
model = BertForMaskedLM(config=config)

from transformers import Trainer, TrainingArguments

from transformers import DataCollatorForLanguageModeling

data_collator = DataCollatorForLanguageModeling(
    tokenizer=bert_cased_tokenizer, mlm=True, mlm_probability=0.15
)

training_args = TrainingArguments(
    output_dir=PATHS["model"]["cased"]["training"]["local"],
    overwrite_output_dir=True,
    num_train_epochs=2,
    per_gpu_train_batch_size= 8, ## 512 max sequence lenght, 64 sequence count
    save_steps=10_000,
    save_total_limit=2,
    prediction_loss_only=True,
)

trainer = Trainer(
    model=model,
    args=training_args,
    data_collator=data_collator,
    train_dataset=dataset,
)

Yes, the warning is telling you that some weights were randomly initialized (here you classification head), which is normal since you are instantiating a pretrained model for a different task. It’s there to remind you to finetune your model (it’s not usable for inference directly).

4 Likes

Thanks @sgugger ! :slightly_smiling_face:

Hi @sgugger , may I know how to suppress this information? I load pretrained model and multiprocess this , so this notification is visually overwhelming. Thank you!

1 Like

Is there a way to avoid some weights to be initialized randomly since already initialized ?

To suppress this output, nlp - Python: BERT Error - Some weights of the model checkpoint at were not used when initializing BertModel - Stack Overflow suggests to change the verbosity level of transformers.logging.

Is there a way to use distilbert-base-uncased and other models out-of-the-box without fine tuning to benchmark?