On using the final [CLS] hidden state of RoBERTa

Aron · February 22, 2022, 9:51am

Often the last hidden state of the [CLS] token is used as a feature to train a downstream model on. I am confused by this, specifically for RoBERTa because of its training procedure.

I understand that simply having the same token present at the start of all sentences will enable it, through self attention, to learn some representation of the sentence as a whole.
However, if I understand the training procedure correctly, the last [CLS] hidden state never enters in any loss. The other tokens have a chance to get masked and then go through the final softmax layer and be compared with the actual word, thus learning a sensible representation. So do the earlier [CLS] hidden states, indirectly through the other tokens.
(For BERT this is different as the final [CLS] hidden state is used in the next sentence prediction task, but this is excluded in RoBERTa.)

But the final [CLS] hidden state? I don’t see how this can give a sensible representation. Is it simply that, precisely because it never sees the loss, the final weights are never updated from their initialization, and since the next to last [CLS] hidden state does get the chance to learn and give a sensible representation, a multiplication of this by a fixed random matrix is still sensible? But then surely it’s better to just use the next to last hidden state directly?

Am I missing something?

mvonwyl · February 24, 2022, 1:41pm

That’s a very good observation on the paper.

My understanding, is the [CLS] token will learn how to look through the full input when the model is finetuned on a downstream task since, as you said, it’s not used at pre-training. Note that, in the paper, for RTE, STS and MRPC, they first finetune on MNLI (which is very close to a next sentence prediction task) instead of taking it directly after pre-training (section 5.1). They also talk about how they use the [CLS] token as the input to a classifier head for RACE (section 5.3).

gmihaila · November 9, 2023, 11:40am

Found any info on this? I had same question.

Topic		Replies	Views
Question about the causality of Roberta TOKENS 🤗Transformers	0	165	January 31, 2023
Common practice, using the hidden state associated with [cls] as an input feature for a classification task? Intermediate	3	5672	January 31, 2024
Does it make sense to use CLS token on RoBERTa based models? Models	2	2337	March 30, 2021
Costumizing MASKed tokens 🤗Transformers	1	243	September 27, 2023
Further pre-training the tokenizer? 🤗Tokenizers	0	821	April 30, 2022

On using the final [CLS] hidden state of RoBERTa

Related topics