Current best practice for final linear classifier layer(s)?

I’m using BERT to perform text classification (sentiment analysis or NLI). I pass a 768-D vector through linear layers to get to a final N-way softmax. I was wondering what is the current best practice for the final block of linear layers?

I see in the implementation of BertForSequenceClassification that the 768-D pooled output is passed through a Dropout and a Linear layer.

pooled_output = outputs[1]

pooled_output = self.dropout(pooled_output)
logits = self.classifier(pooled_output)

Is this the current best practice? What about adding more linear layers, dropout, relu, batchnorm, etc.?

I am using this classifier architecture:

pooled_output = outputs[1] # 768-D
# --- start block ---
Linear (out=1000-D)
ReLU
BatchNorm
Dropout (0.25)
# --- end block ---
Linear (out=N) # Final N-way softmax

I can repeat the classifier block as many times as I want with any intermediate dimensionality. I’m worried that my knowledge of using ReLU, batchnorm, and dropout may be outdated.

Any help would be appreciated.

1 Like

There is already one hidden layer between the final hidden state and the pooled output you see, so the one in SequenceClassificationHead is the second one. Usually for classification head, two hidden layers are sufficient (talking about vision as well as text here), but you can certainly try more and see if you get better results.

Thank you for the reply.

Do you have an opinion on using batch normalization in the classification head? I don’t see any specific use of it in HuggingFace’s code for the BERT models. I’m using the HuggingFace code as a best practice for lack of other information.

Again, try and see if it gives you better results. The head is the one used in the BERT article for the fine-tuning, so it’s there for reproducibility. Note that transformers use LayerNorm inside so maybe this one could work well too for classification.