Hi everyone, I was looking into the source code for BertForSequenceClassification and RobertaForSequenceClassification and found that the BertForSequenceClassification uses the pooled_output produced by the BertPooler to generate the head, on the other hand RobertaForSequenceClassification uses the sequence_output instead of using the pooled_output but reimplements the same with an additional dropout layer to generate the head.
If I unravel and compare the code, you can see an extra dropout layer.

Why not simply use the RobertaPooler/pooled_output like Bert, What is the significance of this extra dropout layer?