Trying to understand XForSequenceClassification heads

When answering this question, I found that BertForSequenceClassification’s doesn’t actually use the pretrained linear weights for the classification layer. I kinda had expected that it did. It probably would not be useful for most tasks and needs finetuning anyway, but still.

Maybe the XForX models can have a line in their docs stating which of its heads are pretrained and which ones aren’t.