I’ve been using the HuggingFace library for quite sometime now. I go by the tutorials, swap the tutorial data with my project data and get very good results. I wanted to dig into a little bit deeper into how the classification happens by BERT and BERT-based models. I’m not able to understand a key significant feature - the [CLS] token which is responsible for the actual classification. I hope smart people here could answer my questions because I’m unable to find them on my own.
When I searched for what the [CLS] token actually represent, most of the results indicate that “it is an aggregate representation of the sequence”. I can understand this part. Basically before BERT, people have used different techniques to represent documents ranging from averaging the word vectors of the document to computing document vectors using doc2vec. I can also understand that stacking a linear classification and feeding in the values for the [CLS] token (768 dim for a bert-base-uncased model), we can end up classifying the sequence.
Here are my questions:
- Is my above understanding of the [CLS] token correct?
- Why is it always the first token? Why not the second, third or last? Did the authors of the original BERT paper get it to be the first token by trial and error?
- How exactly does it “learn” the representation of the sequence? I mean its basically trained in the same way as the other input tokens in the sequence, so what makes it special to represent the entire sequence? I couldn’t find any explanation to this question from either the paper or my search afterwards.
- Is it at all possible to get back the original sequence using the [CLS] token (I think not but worth asking)?
I hope I can find some answers to these questions (or at least pointers to resources where I can find them). Please let me know if this is not correct place to post these questions and where I should post them.