Classification Heads in BERT and DistilBERT for Sequence Classification

valkyrie · May 12, 2021, 4:16pm

Hi,

I have been using BertForSequenceClassification and DistilBertForSequenceClassification recently and I have noticed that they have different classification heads.

BertForSequenceClassification has a dropout layer and a linear layer, whereas DistilBertForSequenceClassification has two linear layers and a dropout layer.

Is there a particular reason for this?

Thanks in advance!

sgugger · May 12, 2021, 6:22pm

All in all, they have the same head: BertForSequenceClassification has a dropout layer and a linear layer but uses the pooler output, which went through a linear layer inside the BertModel.

DistilBertModel has no pooler output however, so the first linear layer is there to replicate that.

valkyrie · May 13, 2021, 9:28am

Thank you that makes sense!

Topic		Replies	Views
BertForSequenceClassification classification head question 🤗Transformers	0	297	July 7, 2022
Implementation difference between Bert and Roberta ForSequenceClassification? 🤗Transformers	0	558	June 24, 2021
Always only a single Linear layer as the classification head? 🤗Transformers	0	337	February 23, 2023
What is the classification head doing exactly? 🤗Transformers	16	24396	November 4, 2024
Current best practice for final linear classifier layer(s)? Beginners	3	2422	September 12, 2020

Classification Heads in BERT and DistilBERT for Sequence Classification

Related topics