Current best practice for final linear classifier layer(s)?

facehugger2020 · September 12, 2020, 3:13am

I’m using BERT to perform text classification (sentiment analysis or NLI). I pass a 768-D vector through linear layers to get to a final N-way softmax. I was wondering what is the current best practice for the final block of linear layers?

I see in the implementation of BertForSequenceClassification that the 768-D pooled output is passed through a Dropout and a Linear layer.

pooled_output = outputs[1]

pooled_output = self.dropout(pooled_output)
logits = self.classifier(pooled_output)

Is this the current best practice? What about adding more linear layers, dropout, relu, batchnorm, etc.?

I am using this classifier architecture:

pooled_output = outputs[1] # 768-D
# --- start block ---
Linear (out=1000-D)
ReLU
BatchNorm
Dropout (0.25)
# --- end block ---
Linear (out=N) # Final N-way softmax

I can repeat the classifier block as many times as I want with any intermediate dimensionality. I’m worried that my knowledge of using ReLU, batchnorm, and dropout may be outdated.

Any help would be appreciated.

sgugger · September 12, 2020, 1:27pm

There is already one hidden layer between the final hidden state and the pooled output you see, so the one in SequenceClassificationHead is the second one. Usually for classification head, two hidden layers are sufficient (talking about vision as well as text here), but you can certainly try more and see if you get better results.

facehugger2020 · September 12, 2020, 3:07pm

Thank you for the reply.

Do you have an opinion on using batch normalization in the classification head? I don’t see any specific use of it in HuggingFace’s code for the BERT models. I’m using the HuggingFace code as a best practice for lack of other information.

sgugger · September 12, 2020, 11:26pm

Again, try and see if it gives you better results. The head is the one used in the BERT article for the fine-tuning, so it’s there for reproducibility. Note that transformers use LayerNorm inside so maybe this one could work well too for classification.

Topic		Replies	Views
BertForSequenceClassification classification head question 🤗Transformers	0	297	July 7, 2022
Always only a single Linear layer as the classification head? 🤗Transformers	0	337	February 23, 2023
Classification Heads in BERT and DistilBERT for Sequence Classification Research	2	1184	May 13, 2021
How downstream tasks work Beginners	1	973	July 13, 2021
BertForSequenceClassification only seems to have linear activation at the end - is this a bug? 🤗Transformers	1	2891	September 30, 2020

Current best practice for final linear classifier layer(s)?

Related topics