Trying to understand XForSequenceClassification heads

I’m interested in 1-sentence and 2-sentence text classification, so I’ve been looking at the classification heads for BERT, GPT2, XLNet, and RoBERTa. I have a few questions:

1. I see that there are dedicated classification classes BertForSequenceClassification, XLNetForSequenceClassification, and RobertaForSequenceClassification. However, there is no XForSequenceClassification class for GPT2. Is there any documentation to help us write our own?

2. When I look at the classification heads for BERT, XLNet, and RoBERTa, the layer structure for producing the logits appears to be different for each one. I would think that the final few layers would be exactly the same.

For example, here is the code for the BERT classification head:

class BertForSequenceClassification(BertPreTrainedModel):
    def __init__(self, config):
        self.bert = BertModel(config)
        self.dropout = nn.Dropout(config.hidden_dropout_prob)
        self.classifier = nn.Linear(config.hidden_size, config.num_labels)

    def forward( ... ):
        outputs = self.bert( ... )
        pooled_output = outputs[1]
        pooled_output = self.dropout(pooled_output)
        logits = self.classifier(pooled_output)

Here is the code for the XLNet classification head:

class XLNetForSequenceClassification(XLNetPreTrainedModel):

    def __init__(self, config):
        self.transformer = XLNetModel(config)
        self.sequence_summary = SequenceSummary(config)
        self.logits_proj = nn.Linear(config.d_model, config.num_labels)

    def forward( ... ):
        transformer_outputs = self.transformer( ... )
        output = transformer_outputs[0]
        output = self.sequence_summary(output)
        logits = self.logits_proj(output)

Here is the code for the RoBERTa classification head:

class RobertaForSequenceClassification(BertPreTrainedModel):
    config_class = RobertaConfig
    base_model_prefix = "roberta"

    def __init__(self, config):
        self.num_labels = config.num_labels

        self.roberta = RobertaModel(config)
        self.classifier = RobertaClassificationHead(config)

    def forward( ... ):
        outputs = self.roberta( ... )
        sequence_output = outputs[0]
        logits = self.classifier(sequence_output)

class RobertaClassificationHead(nn.Module):
    """Head for sentence-level classification tasks."""

    def __init__(self, config):
        self.dense = nn.Linear(config.hidden_size, config.hidden_size)
        self.dropout = nn.Dropout(config.hidden_dropout_prob)
        self.out_proj = nn.Linear(config.hidden_size, config.num_labels)

    def forward(self, features, **kwargs):
        x = features[:, 0, :]  # take <s> token (equiv. to [CLS])
        x = self.dropout(x)
        x = self.dense(x)
        x = torch.tanh(x)
        x = self.dropout(x)
        x = self.out_proj(x)
        return x

From the above code, producing the output logits involves:

  1. BERT: dropout, linear
  2. XLNet: linear
  3. RoBERTa: dropout, linear, tanh, dropout, linear

Why does each model implement the final layers differently? Are the implementations taken from the original papers? Wouldn’t it be better to make the classification layers exactly the same for each model so that classification relative performance would be a result of the models’ internal architecture rather than the classification layer?

Thank you for any help.

1 Like

Any help would be greatly appreciated.

As you mentioned, this has been taken from the original implementation, at least for BERT. I am not sure about the others because IIRC RoBERTa removed the NSP objective so it was not pretrained with a sentence classification head.

It is worth noting that, again if I recall correctly, BertForSequenceClassification’s head is* pretrained, whereas e.g. RobertaForSequenceClassification isn’t.

cc @thomwolf Do you remember how classification heads were implemented when the original implementation was not pretrained on such objective? Also please correct me if my statement above is incorrect.

1 Like

Yes these are the classification heads as provided by the various research teams which is the reason they are different from each other.

Bert’s classification head is somehow pretrained on the NSP task but you should probably still train it if you want to use use it for another task anyway.

We don’t provide a classification head for GPT/GPT2 because you need to add a new token to the vocabulary/model to use it as a classification model and decide how you want to process your data which seems (to me) a stronger step than just training a head like the other more recent models. This is debatable I guess.


When answering this question, I found that BertForSequenceClassification’s doesn’t actually use the pretrained linear weights for the classification layer. I kinda had expected that it did. It probably would not be useful for most tasks and needs finetuning anyway, but still.

Maybe the XForX models can have a line in their docs stating which of its heads are pretrained and which ones aren’t.

I agree that the documentation should have a statement that the classifier weights aren’t pretrained, but I think that fact is clarified when you download the model. The (verbose) warning says:

Some weights of the model checkpoint at bert-base-uncased were not used when initializing BertForSequenceClassification: [‘cls.predictions.bias’, ‘cls.predictions.transform.dense.weight’, ‘cls.predictions.transform.dense.bias’, ‘cls.predictions.decoder.weight’, ‘cls.seq_relationship.weight’, ‘cls.seq_relationship.bias’, ‘cls.predictions.transform.LayerNorm.weight’, ‘cls.predictions.transform.LayerNorm.bias’]
- This IS expected if you are initializing BertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPretraining model).
- This IS NOT expected if you are initializing BertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: [‘classifier.weight’, ‘classifier.bias’]
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.

Yes, that is very good warning. However, logging is user-controlled, and might have easily been disabled by users when they user their own logger for instance. Although the warning is probably useful for most people, a clarification in the documentation would not hurt.

I also found this statement in the Transformers introductory page on Summary of the tasks:

All tasks presented here leverage pre-trained checkpoints that were fine-tuned on specific tasks. Loading a checkpoint that was not fine-tuned on a specific task would load only the base transformer layers and not the additional head that is used for the task, initializing the weights of that head randomly.
This would produce random output.

Yes, that is also a good one. The problem that I am having is that this is not always intuitive. Take for instance the BertForSequenceClassification model. I would have expected that the classifier layer was pretrained (because of the NSP task mentioned in the paper), but it isn’t. A small note in the documentation would immediately make it clear to a user when they look it up.