Often the last hidden state of the [CLS] token is used as a feature to train a downstream model on. I am confused by this, specifically for RoBERTa because of its training procedure.
I understand that simply having the same token present at the start of all sentences will enable it, through self attention, to learn some representation of the sentence as a whole.
However, if I understand the training procedure correctly, the last [CLS] hidden state never enters in any loss. The other tokens have a chance to get masked and then go through the final softmax layer and be compared with the actual word, thus learning a sensible representation. So do the earlier [CLS] hidden states, indirectly through the other tokens.
(For BERT this is different as the final [CLS] hidden state is used in the next sentence prediction task, but this is excluded in RoBERTa.)
But the final [CLS] hidden state? I don’t see how this can give a sensible representation. Is it simply that, precisely because it never sees the loss, the final weights are never updated from their initialization, and since the next to last [CLS] hidden state does get the chance to learn and give a sensible representation, a multiplication of this by a fixed random matrix is still sensible? But then surely it’s better to just use the next to last hidden state directly?
Am I missing something?