What is the purpose of the additional dense layer in classification heads?

I was looking at the code for RoobertaClassificationHead and it adds an additional dense layer, which is not described in the paper for fine-tuning for classification.
I have looked at a few other classification heads in the Transformers library and they also add that additional dense layer.
For example, the classification head for RoBERTa is:

class RobertaClassificationHead(nn.Module):
    """Head for sentence-level classification tasks."""
    def __init__(self, config):
        self.dense = nn.Linear(config.hidden_size, config.hidden_size)
        self.dropout = nn.Dropout(config.hidden_dropout_prob)
        self.out_proj = nn.Linear(config.hidden_size, config.num_labels)
    def forward(self, features, **kwargs):
        x = features[:, 0, :]  # take <s> token (equiv. to [CLS])
        x = self.dropout(x)
        x = self.dense(x)
        x = torch.tanh(x)
        x = self.dropout(x)
        x = self.out_proj(x)
        return x

To match the paper, it should be:

    def forward(self, features, **kwargs):
        x = features[:, 0, :]  # take <s> token (equiv. to [CLS])
        x = self.dropout(x)
        x = self.out_proj(x)
        return x

What is the purpose of that additional dense + tanh + dropout?

Thank you very much!

1 Like

Roberta does not have a pooler layer (like Bert for instance) since the pretraining objective does not contain a classification task. When doing sentence classification with bert, your final hidden states go through a BertPooler (which is just dense + tanh), a dropout and a final classification layer (which is a dense layer).

This structure is mimicked for all models on a sentence classification task, which is why for Roberta (which does not have a pooler) you get those two linear layers in the classification head. Hope that makes sense!


That makes sense! thank you for the explanation!

I’m curious about the theory behind having two linear final layers, though. Is it just to add another non-linearity (because of the activation function) at the end to train? Has there been any research that confirms that this yields better results than a single final linear layer?

1 Like

I don’t think there has been lots of research on this. Also, has the doc say, the pooler_output in BERT is not a good summary and should not be used in classification, yet we do that still.
When I have some time, I’d love to experiment a bit and see which kind of classification head works best, but if anyone wants to try different variants and report results here, that would be super helpful!

I’ve read that statement a number of times, indeed, though I have never encountered any issues with it - although I typically use my own configuration. For a sentence classification task I have always followed the results from the original BERT layer: concatenating the hidden states of the last four layers, taking the output from [CLS], and adding a linear layer, activation, and final linear layer on top.

My guess is, as always, that it depends on your specific use case, dataset, optimizer.

Perhaps the warning is that the CLS item is not a good semantic representation. So if you want to use that final hidden state as “semantic features” in another system , you may want to opt for the mean over the tokens or something like that. But if you are going straight into the classification problem after the transformer layers, you may can resort to [CLS]? Not sure.

1 Like

@BramVanroy @sgugger,

I studied this a while ago and can’t remember all the details. I think it has something to do with Bert being trained at the token-level and it’s loss objective. Here is a paper about Sentence-Bert that is a good reference. Also, here is the Sentence Transformers github by the author. Just reading through the issues on there was a wealth of information on the topic.

@BramVanroy In the BERT paper, the author use concatenation of the last 4 layers only for the feature-based approach (BERT layers are frozen). For the fine-tuning approach, he just use the last CLS representation. (Section 5.3 of BERT paper)

I’m wondering if you tried fine-tuning with only the last layer versus concatenation of the 4 last layers, and see any improvement ?

I’m also curious about the the effects of having a single Linear or 2 Linears with non-linearity on the classification scores.

I tried both concatenating and using the last layer’s hidden, frozen and not frozen, and in all cases (in my task back then), the concatenation worked best. So I use it as my go-to method now, although I cannot be certain that it works equally well for other tasks or models.


Thanks for the answer ! I have to try it as well then ^^