How do I change the classification head of a model?

I like to change the number of labels that a trained model has. I am loading a model that was trained on 17 classes and I like adapt this model to my own task. Now if I simply change the number of labels like this:

model_checkpoint ="vblagoje/bert-english-uncased-finetuned-pos" 
model = AutoModelForTokenClassification.from_pretrained(model_checkpoint,num_labels=2)

I get an error saying:

RuntimeError: Error(s) in loading state_dict for BertForTokenClassification:
        size mismatch for classifier.weight: copying a param with shape torch.Size([17, 768]) from checkpoint, the shape in current model is torch.Size([2, 768]).
        size mismatch for classifier.bias: copying a param with shape torch.Size([17]) from checkpoint, the shape in current model is torch.Size([2]).

My question is: How do I replace the classification head?

Thanks a lot :hugs:


The reason is: you are trying to use mode, which has already pretrained on a particular classification task. You have to remove the last part ( classification head) of the model.

This is actually a kind of design fault too. In practice

( BERT base uncased + Classification ) = new Model .

is your model. Now, if you want to reuse them on a different tasks, either use BERT base uncased or extract that part from new Model.

I wanted to test if the training on the POS task results in better scores compared to using just the pain BERT base.

So my question is: How do I extract the “base” part from a trained model and add a new head?

I am not sure :slight_smile:

Hi from what i noticed the weights your using are already fine tuned for token classification (the classifier has been trained for said task), i recommend you fine tune on the bert base case as such:

from transformers import AutoModelForTokenClassification
model = AutoModelForTokenClassification('bert-base-uncased', num_labels=2)
# Start your own training

or if you want to write your own as requested with a custom classifier head

import torch.nn as nn
from transformers import AutoModel
class PosModel(nn.Module):
    def __init__(self):
        super(PosModel, self).__init__()
        self.base_model = AutoModel.from_pretrained('bert-base-uncased')
        self.dropout = nn.Dropout(0.5)
        self.linear = nn.Linear(768, 2) # output features from bert is 768 and 2 is ur number of labels
    def forward(self, input_ids, attn_mask):
        outputs = self.base_model(input_ids, attention_mask=attn_mask)
        # You write you new head here
        outputs = self.dropout(outputs[0])
        outputs = self.linear(outputs)
        return outputs

model = PosModel()'cuda')

I am assuming your using pytorch for this.

This is a good solution if you train new a model based on a LM like bert-based-uncased from scratch. I try to replace the classification head of a model. Running your first code with a pre-trained model for token classification will result in an error message (see my sample above).
Running your second code will add a new classification layer on top of the existing one when you run it with a pos model like “vblagoje/bert-english-uncased-finetuned-pos”.

Your error basically is a mismatch of the final layer which is the classifier part:

vblagoje/bert-english-uncased-finetuned-pos - this is already finetune weights for 17 classes

bert-base-uncased - you need to use this where isn’t finetuned for anything yet, so this is what you want to use for just fine tuning the 2 classes you have

Unfortunately the author of the model did not write anything to specify what he fine tuned on so i have no idea myself but this model has already been fine tuned, therefore you cannot continue fine tuning it


Technically it should not be an issue to remove a classification head. This is the main idea of transfer learning and people do this all the time with CNNs. I like to check if this works for NLP as well.

I mean for this case here, when you load the weights it expects the linear layer to have a size of 17 but u specified 2, this is where the error comes from

size mismatch for classifier.weight: copying a param with shape torch.Size([17, 768]) from checkpoint, the shape in current model is torch.Size([2, 768]).

Well technically you can do that by deleting the last couple of layers after loading it, after then fine tuning it. again

# did not test this out
model = AutoModelForTokenClassification.from_pretrained(model_checkpoint,num_labels=17)
model.classifier = nn.Linear(768, 2)
# Run training

Well, I tried that and added two lines to set the correct number of labels:

model.classifier = nn.Linear(786,1)
model.num_labels = 2
model.config.num_labels = 2

printing the model shows that this worked.

  (dropout): Dropout(p=0.1, inplace=False)
  (classifier): Linear(in_features=786, out_features=1, bias=True)

However, running this with the trainer class results in this error:

mat1 dim 1 must match mat2 dim 0

I suspect that I missed something or broke some clever autoconfiguration. Thats why I wrote my question here.

model.classifier = nn.Linear(786,1) # you have 1 class? I think you should change it to 2 
model.num_labels = 2 # while here you specify 2 classes so its a bit confusing

Unless you are aiming for a sigmoid function for your last layer is thats why your adding 1 class then i think you need to change to your loss function to bcewithlogitsloss

1 Like

Check this. Its neatly implemented here.
And a pretty neat architecture as well.

model —> downstream_task . Well defined.

Thanks for your help! It was way too late when I wrote this :see_no_evil:

It turned out that the easiest way to solve my initial question is this:

config = AutoConfig.from_pretrained(model_checkpoint)
config.num_labels = 2
model = AutoModelForTokenClassification.from_config(config)
1 Like

It works, but how this change affects the model architecture, and the results? It would be great if anyone can explain the intuition behind this.

I don’t think this solved your problem. Initialising model with ‘from_config’ only changes model configuration and it does not load model weight.

Does anyone know how to solve this problem?

Changing the classification head scenarios suggested above does not work for my case. Instead, I suggest, which works for me, that you can change the body instead of head as follows

old_model= BertForSequenceClassification.from_pretrained("model-x") new_model=BertForSequenceClassification.from_pretrained("bert-base-uncased", num_labels=HowMany_LABELS_I_WANT) new_model.bert=old_model.bert
which works for me

This is now possible (thanks to @sgugger) by passing in an additional argument called ignore_mismatched_sizes, which you can set to True.

If you have an already fine-tuned model with, let’s say 17 labels, and you want to replace the head with one that has 10 outputs, you can do it as follows:

from transformers import BertForTokenClassification

model_name = "vblagoje/bert-english-uncased-finetuned-pos"

model = BertForTokenClassification.from_pretrained(model_name, num_labels=10, ignore_mismatched_sizes=True)

This will print the following warning:

Some weights of BertForTokenClassification were not initialized from the model checkpoint at vblagoje/bert-english-uncased-finetuned-pos and are newly initialized because the shapes did not match:
- classifier.weight: found shape torch.Size([17, 768]) in the checkpoint and torch.Size([10, 768]) in the model instantiated
- classifier.bias: found shape torch.Size([17]) in the checkpoint and torch.Size([10]) in the model instantiated
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.