A common practice for doing classification on top of encoder only model is to add a classification head on top of pooled outputs from the encoder. The pooled outputs usually go through a dropout layer before the linear classification layer and the dropout probability is specified by classifier_dropout
.
However, I see that across a few common decoder models (Mistral, Llama, Phi3), the sequence classification head (in the *Model*ForSequenceClassification class) does not have a dropout layer. Moreover, dropout IS available for the token_classification heads though.
Is there any underlying reason why dropout is not used for sequence classification but is available for token classification?