Classification Heads in BERT and DistilBERT for Sequence Classification

Hi,

I have been using BertForSequenceClassification and DistilBertForSequenceClassification recently and I have noticed that they have different classification heads.

BertForSequenceClassification has a dropout layer and a linear layer, whereas DistilBertForSequenceClassification has two linear layers and a dropout layer.

Is there a particular reason for this?

Thanks in advance!

All in all, they have the same head: BertForSequenceClassification has a dropout layer and a linear layer but uses the pooler output, which went through a linear layer inside the BertModel.

DistilBertModel has no pooler output however, so the first linear layer is there to replicate that.

2 Likes

Thank you that makes sense!