Classification Heads in BERT and DistilBERT for Sequence Classification

valkyrie · May 12, 2021, 4:16pm

Hi,

I have been using BertForSequenceClassification and DistilBertForSequenceClassification recently and I have noticed that they have different classification heads.

BertForSequenceClassification has a dropout layer and a linear layer, whereas DistilBertForSequenceClassification has two linear layers and a dropout layer.

Is there a particular reason for this?

Thanks in advance!

Topic		Replies	Views
Implementation difference between Bert and Roberta ForSequenceClassification? 🤗Transformers	0	562	June 24, 2021
Trying to understand XForSequenceClassification heads Intermediate	8	1328	September 24, 2020
`seq_classif_dropout = 0.2` what is the use of adding dropout after the classification network 🤗Transformers	0	105	March 14, 2024
Fine-Tune BERT with two Classification Heads "next to each other"? Beginners	3	2723	September 17, 2021
Dropout as the final layer in the pretrained model (DistilBERT) Models	1	1209	May 22, 2022

Classification Heads in BERT and DistilBERT for Sequence Classification

Related topics