Significance of the [CLS] token

Yes.

Using it as the first token is not “special” or new. Earlier NLP approaches also often had a beginning of sentence (BOS) token or similar. You wouldn’t want the token to be in-between other tokens either. The reason for this is not so much for CLS itself but for the other tokens: positioning of tokens relative to each other is important as the position of a token in a sequence changes its value due to positional encoding. Linguistically you’d therefore want the sequence order as-is without any information floating in between.

AFAIK special tokens cannot be masked during pretraining. So it is always at the front and through attention its importance is learnt. Like other tokens - but without it ever having to be “predicted”. In the second pretraining task, next sentence prediction, its final hidden state serves as the input for classification.

Without fine-tuning? No. Also note that BERT only has an encoder. To “generate” the original tokens, you’d typically need a decoder. You could try something like an auto-encoder, or set-up an encoder-decoder similar to single-representation MT. But chances are small that you can reproduce exactly the same input sentence.

Perhaps interesting to read into is Table 7 of their paper. They did not only try using the final state of CLS in downstream tasks, but also different feature extractions across the model.

3 Likes