Significance of the [CLS] token


I’ve been using the HuggingFace library for quite sometime now. I go by the tutorials, swap the tutorial data with my project data and get very good results. I wanted to dig into a little bit deeper into how the classification happens by BERT and BERT-based models. I’m not able to understand a key significant feature - the [CLS] token which is responsible for the actual classification. I hope smart people here could answer my questions because I’m unable to find them on my own.

When I searched for what the [CLS] token actually represent, most of the results indicate that “it is an aggregate representation of the sequence”. I can understand this part. Basically before BERT, people have used different techniques to represent documents ranging from averaging the word vectors of the document to computing document vectors using doc2vec. I can also understand that stacking a linear classification and feeding in the values for the [CLS] token (768 dim for a bert-base-uncased model), we can end up classifying the sequence.

Here are my questions:

  1. Is my above understanding of the [CLS] token correct?
  2. Why is it always the first token? Why not the second, third or last? Did the authors of the original BERT paper get it to be the first token by trial and error?
  3. How exactly does it “learn” the representation of the sequence? I mean its basically trained in the same way as the other input tokens in the sequence, so what makes it special to represent the entire sequence? I couldn’t find any explanation to this question from either the paper or my search afterwards.
  4. Is it at all possible to get back the original sequence using the [CLS] token (I think not but worth asking)?

I hope I can find some answers to these questions (or at least pointers to resources where I can find them). Please let me know if this is not correct place to post these questions and where I should post them.

Thank you.

1 Like

I would love to hear from others!

Hi, @shaun

I believe “first token” is selected arbitrarily / conveniently .
In practice, you can finetune a classification task using any tokens or “average of tokens” (GlobalPooling1D) .

Thanks for the reply. But isn’t the other tokens specific to a particular input token as opposed to the [CLS] token which doesn’t correspond to any input token? If that’s the case, how does it make sense to take finetune any token for our classification?

This is what is tripping me up. Is there no reasoning empirical or otherwise to create a token called [CLS] to be used as input for downstream classification tasks?

I may be wrong when I said any tokens would do . If you have time, maybe you can just make an experiment about that.

My intuition is that at first each other token may indeed represent each original input token. But if you finetune any of them (backpropagation), it can also perform as good as [CLS]. (Never actually tried it).

One thing from my experience on Kaggle NLP competitions, however, is that the use of GlobalPooling1D is not inferior to [CLS] .