I’m using BERT to perform text classification (sentiment analysis or NLI). I pass a 768-D vector through linear layers to get to a final N-way softmax. I was wondering what is the current best practice for the final block of linear layers?
I see in the implementation of BertForSequenceClassification that the 768-D pooled output is passed through a Dropout and a Linear layer.
pooled_output = outputs[1]
pooled_output = self.dropout(pooled_output)
logits = self.classifier(pooled_output)
Is this the current best practice? What about adding more linear layers, dropout, relu, batchnorm, etc.?
I am using this classifier architecture:
pooled_output = outputs[1] # 768-D
# --- start block ---
Linear (out=1000-D)
ReLU
BatchNorm
Dropout (0.25)
# --- end block ---
Linear (out=N) # Final N-way softmax
I can repeat the classifier block as many times as I want with any intermediate dimensionality. I’m worried that my knowledge of using ReLU, batchnorm, and dropout may be outdated.
Any help would be appreciated.