Classification Heads in BERT and DistilBERT for Sequence Classification

Hi,

I have been using BertForSequenceClassification and DistilBertForSequenceClassification recently and I have noticed that they have different classification heads.

BertForSequenceClassification has a dropout layer and a linear layer, whereas DistilBertForSequenceClassification has two linear layers and a dropout layer.

Is there a particular reason for this?

Thanks in advance!