Implementation difference between Bert and Roberta ForSequenceClassification?

Hi everyone, I was looking into the source code for BertForSequenceClassification and RobertaForSequenceClassification and found that the BertForSequenceClassification uses the pooled_output produced by the BertPooler to generate the head, on the other hand RobertaForSequenceClassification uses the sequence_output instead of using the pooled_output but reimplements the same with an additional dropout layer to generate the head.
If I unravel and compare the code, you can see an extra dropout layer.
bertcompare

Why not simply use the RobertaPooler/pooled_output like Bert, What is the significance of this extra dropout layer?

2 Likes