What is the purpose of the additional dense layer in classification heads?

edugp · July 29, 2020, 6:21pm

I was looking at the code for RoobertaClassificationHead and it adds an additional dense layer, which is not described in the paper for fine-tuning for classification.
I have looked at a few other classification heads in the Transformers library and they also add that additional dense layer.
For example, the classification head for RoBERTa is:

class RobertaClassificationHead(nn.Module):
    """Head for sentence-level classification tasks."""
    def __init__(self, config):
        super().__init__()
        self.dense = nn.Linear(config.hidden_size, config.hidden_size)
        self.dropout = nn.Dropout(config.hidden_dropout_prob)
        self.out_proj = nn.Linear(config.hidden_size, config.num_labels)
    def forward(self, features, **kwargs):
        x = features[:, 0, :]  # take <s> token (equiv. to [CLS])
        x = self.dropout(x)
        x = self.dense(x)
        x = torch.tanh(x)
        x = self.dropout(x)
        x = self.out_proj(x)
        return x

To match the paper, it should be:

    def forward(self, features, **kwargs):
        x = features[:, 0, :]  # take <s> token (equiv. to [CLS])
        x = self.dropout(x)
        x = self.out_proj(x)
        return x

What is the purpose of that additional dense + tanh + dropout?

Thank you very much!

sgugger · July 29, 2020, 8:45pm

Roberta does not have a pooler layer (like Bert for instance) since the pretraining objective does not contain a classification task. When doing sentence classification with bert, your final hidden states go through a BertPooler (which is just dense + tanh), a dropout and a final classification layer (which is a dense layer).

This structure is mimicked for all models on a sentence classification task, which is why for Roberta (which does not have a pooler) you get those two linear layers in the classification head. Hope that makes sense!

edugp · July 29, 2020, 9:14pm

That makes sense! thank you for the explanation!

BramVanroy · July 30, 2020, 7:25am

I’m curious about the theory behind having two linear final layers, though. Is it just to add another non-linearity (because of the activation function) at the end to train? Has there been any research that confirms that this yields better results than a single final linear layer?

sgugger · July 30, 2020, 1:43pm

I don’t think there has been lots of research on this. Also, has the doc say, the pooler_output in BERT is not a good summary and should not be used in classification, yet we do that still.
When I have some time, I’d love to experiment a bit and see which kind of classification head works best, but if anyone wants to try different variants and report results here, that would be super helpful!

BramVanroy · July 30, 2020, 1:52pm

I’ve read that statement a number of times, indeed, though I have never encountered any issues with it - although I typically use my own configuration. For a sentence classification task I have always followed the results from the original BERT layer: concatenating the hidden states of the last four layers, taking the output from [CLS], and adding a linear layer, activation, and final linear layer on top.

My guess is, as always, that it depends on your specific use case, dataset, optimizer.

Perhaps the warning is that the CLS item is not a good semantic representation. So if you want to use that final hidden state as “semantic features” in another system , you may want to opt for the mean over the tokens or something like that. But if you are going straight into the classification problem after the transformer layers, you may can resort to [CLS]? Not sure.

aclifton314 · July 30, 2020, 9:22pm

@BramVanroy @sgugger,

I studied this a while ago and can’t remember all the details. I think it has something to do with Bert being trained at the token-level and it’s loss objective. Here is a paper about Sentence-Bert that is a good reference. Also, here is the Sentence Transformers github by the author. Just reading through the issues on there was a wealth of information on the topic.

astariul · July 31, 2020, 5:07am

@BramVanroy In the BERT paper, the author use concatenation of the last 4 layers only for the feature-based approach (BERT layers are frozen). For the fine-tuning approach, he just use the last CLS representation. (Section 5.3 of BERT paper)

I’m wondering if you tried fine-tuning with only the last layer versus concatenation of the 4 last layers, and see any improvement ?

I’m also curious about the the effects of having a single Linear or 2 Linears with non-linearity on the classification scores.

BramVanroy · July 31, 2020, 8:42am

I tried both concatenating and using the last layer’s hidden, frozen and not frozen, and in all cases (in my task back then), the concatenation worked best. So I use it as my go-to method now, although I cannot be certain that it works equally well for other tasks or models.

astariul · August 3, 2020, 2:58am

Thanks for the answer ! I have to try it as well then ^^

Topic		Replies	Views
RobertaClassificationHead - reduce dense layer dimension? 🤗Transformers	0	483	July 23, 2021
Trying to understand XForSequenceClassification heads Intermediate	8	1296	September 24, 2020
Implementation difference between Bert and Roberta ForSequenceClassification? 🤗Transformers	0	530	June 24, 2021
What is the classification head doing exactly? 🤗Transformers	16	21504	November 4, 2024
BertForSequenceClassification classification head question 🤗Transformers	0	286	July 7, 2022

What is the purpose of the additional dense layer in classification heads?

Related topics