Hi everyone, I was looking into the source code for
RobertaForSequenceClassification and found that the
BertForSequenceClassification uses the
pooled_output produced by the
BertPooler to generate the head, on the other hand
RobertaForSequenceClassification uses the
sequence_output instead of using the
pooled_output but reimplements the same with an additional dropout layer to generate the head.
If I unravel and compare the code, you can see an extra dropout layer.
Why not simply use the
pooled_output like Bert, What is the significance of this extra dropout layer?