On using the final [CLS] hidden state of RoBERTa

Often the last hidden state of the [CLS] token is used as a feature to train a downstream model on. I am confused by this, specifically for RoBERTa because of its training procedure.

I understand that simply having the same token present at the start of all sentences will enable it, through self attention, to learn some representation of the sentence as a whole.
However, if I understand the training procedure correctly, the last [CLS] hidden state never enters in any loss. The other tokens have a chance to get masked and then go through the final softmax layer and be compared with the actual word, thus learning a sensible representation. So do the earlier [CLS] hidden states, indirectly through the other tokens.
(For BERT this is different as the final [CLS] hidden state is used in the next sentence prediction task, but this is excluded in RoBERTa.)

But the final [CLS] hidden state? I don’t see how this can give a sensible representation. Is it simply that, precisely because it never sees the loss, the final weights are never updated from their initialization, and since the next to last [CLS] hidden state does get the chance to learn and give a sensible representation, a multiplication of this by a fixed random matrix is still sensible? But then surely it’s better to just use the next to last hidden state directly?

Am I missing something?


That’s a very good observation on the paper.

My understanding, is the [CLS] token will learn how to look through the full input when the model is finetuned on a downstream task since, as you said, it’s not used at pre-training. Note that, in the paper, for RTE, STS and MRPC, they first finetune on MNLI (which is very close to a next sentence prediction task) instead of taking it directly after pre-training (section 5.1). They also talk about how they use the [CLS] token as the input to a classifier head for RACE (section 5.3).