I see in most of Sentiment Analysis tasks which are implemented based on BERT, only the embedding of [CLS]
is passed to classifier, while others are useless. What is the reason behind it?
According to the paper, BERT’s [CLS]
token aggregates the hidden states of the other tokens, which renders them “useless” for sequence classification tasks, as all relevant info is already pooled into [CLS]
.
1 Like
Can you please elaborate more on it? I see the transformers/BERT layers with Token Embeddings, Segment Embeddings, Position Embeddings available along with CLS, SEP