Does it make sense to use CLS token on RoBERTa based models?


I know that some transformer models are not pre trained with Next Sentence Prediction objective, like RoBERTa based models. In that case, CLS token does not mean anything, right?

Given CLS token is not pretrained, when I am developing my downstream classification task, would be better to fine tune this CLS token or perform average pooling on all the tokens?

Thanks in advance,

Hi, the importance of [CLS] token is not only limited to NSP (Next Sequence Prediction) tasks. As far as I understand its importance and its functioning, you can use it for fine-tuning in other tasks too, because [CLS] token is that special token that attends all other tokens in the sequence so it has a representation explaining the knowledge from the context explained in the sequence.
Extending it to NSP task, it learns the representation through self-attention looking around at all the tokens in the context (from both input pair sequence).

My doubt is how the CLS token learns the context in the sequence if it is not used during the pre training (like RoBERTa based models).

If I understood correctly, this CLS token learns only during the finetunning, which is also good enough for text classification.