Significance of the [CLS] token

BramVanroy · September 2, 2021, 7:10am

Yes.

Using it as the first token is not “special” or new. Earlier NLP approaches also often had a beginning of sentence (BOS) token or similar. You wouldn’t want the token to be in-between other tokens either. The reason for this is not so much for CLS itself but for the other tokens: positioning of tokens relative to each other is important as the position of a token in a sequence changes its value due to positional encoding. Linguistically you’d therefore want the sequence order as-is without any information floating in between.

AFAIK special tokens cannot be masked during pretraining. So it is always at the front and through attention its importance is learnt. Like other tokens - but without it ever having to be “predicted”. In the second pretraining task, next sentence prediction, its final hidden state serves as the input for classification.

Without fine-tuning? No. Also note that BERT only has an encoder. To “generate” the original tokens, you’d typically need a decoder. You could try something like an auto-encoder, or set-up an encoder-decoder similar to single-representation MT. But chances are small that you can reproduce exactly the same input sentence.

Perhaps interesting to read into is Table 7 of their paper. They did not only try using the final state of CLS in downstream tasks, but also different feature extractions across the model.

Topic		Replies	Views
How can I implement this BERT model for sequential sentences classification using HuggingFace? Beginners	1	794	September 10, 2023
Which token vector is used for Sentiment Analysis? Beginners	2	341	February 16, 2024
Use of the authentication token Beginners	0	640	March 16, 2023
The inputs into BERT are token IDs. How do we get the corresponding input token VECTORS? Beginners	10	17728	September 15, 2022
Token Classification Model making mistake outside of training dataset Intermediate	0	461	October 30, 2021

Significance of the [CLS] token

Related topics