Hi everyone, I was looking into the source code for BertForSequenceClassification
and RobertaForSequenceClassification
and found that the BertForSequenceClassification
uses the pooled_output
produced by the BertPooler
to generate the head, on the other hand RobertaForSequenceClassification
uses the sequence_output
instead of using the pooled_output
but reimplements the same with an additional dropout layer to generate the head.
If I unravel and compare the code, you can see an extra dropout layer.
Why not simply use the RobertaPooler
/pooled_output
like Bert, What is the significance of this extra dropout layer?