I’m using BERT to perform text classification (sentiment analysis or NLI). I pass a 768-D vector through linear layers to get to a final N-way softmax. I was wondering what is the current best practice for the final block of linear layers?
I see in the implementation of BertForSequenceClassification that the 768-D pooled output is passed through a Dropout and a Linear layer.
pooled_output = outputs pooled_output = self.dropout(pooled_output) logits = self.classifier(pooled_output)
Is this the current best practice? What about adding more linear layers, dropout, relu, batchnorm, etc.?
I am using this classifier architecture:
pooled_output = outputs # 768-D # --- start block --- Linear (out=1000-D) ReLU BatchNorm Dropout (0.25) # --- end block --- Linear (out=N) # Final N-way softmax
I can repeat the classifier block as many times as I want with any intermediate dimensionality. I’m worried that my knowledge of using ReLU, batchnorm, and dropout may be outdated.
Any help would be appreciated.