Hi,
I have been using BertForSequenceClassification and DistilBertForSequenceClassification recently and I have noticed that they have different classification heads.
BertForSequenceClassification has a dropout layer and a linear layer, whereas DistilBertForSequenceClassification has two linear layers and a dropout layer.
Is there a particular reason for this?
Thanks in advance!