Common practice, using the hidden state associated with [cls] as an input feature for a classification task?

carlosaguayo · January 22, 2022, 8:43pm

In the upcoming book “Natural Language Processing with Transformers”, it’s teaching us how to do a classification task on sentences by using Transformers as Feature Extractors. We process the sentences through a transformer to get the hidden state.

To train a classifier, we take the token embedding for just the first token, namely the “[CLS]” and ignore the rest of the sentence. The book says that it’s common practice to do that.

It doesn’t make much sense to me to ignore the rest of the embeddings. Shouldn’t they be averaged or something?

The only reasoning I can think of is that the attention layers of the encoder make the CLS token absorb the meaningful context?

Thank you! The book is awesome by the way, highly recommended!

lewtun · January 25, 2022, 10:19am

Hi @carlosaguayo thanks for your question and I’m glad you’re enjoying the book

In general, we need a way to represent the sequence of embeddings as as single vector and there are several “pooling” techniques that people use in the literature:

[CLS] pooling: just take the embedding of the [CLS] token as the representation for the whole sequence
mean pooling: take the average of token embeddings
max pooling: take the token embedding with the largest values

A related question is whether pooling should be applied to the last hidden states, or some earlier layers (or concatenation thereof).

Now, which pooling method + layer(s) provides the best feature representation tends to depend on the task at hand, the domain of the data, length of the texts and so on. We picked [CLS] pooling in this early chapter because it’s simple and tends to be “good enough” for text classification tasks. You can find a nice ablation study that examined some of these issues here.

As to why this even works, you’re insight that it’s due to self-attention is spot on! Each token embedding in the sequence is contextualised through the attention mechanism, so the [CLS] token does contain information about subsequent tokens in the sequence (we explain this in more detail in Chapter 3).

Hope that helps!

carlosaguayo · January 25, 2022, 5:50pm

It does help! Thank you for the explanation and the link!

kumarvishal · January 31, 2024, 4:13am

Even I had same query, it clarifies up to great extend.

Topic		Replies	Views
On using the final [CLS] hidden state of RoBERTa Beginners	2	3022	November 9, 2023
Does it make sense to use CLS token on RoBERTa based models? Models	2	2339	March 30, 2021
Transformer vs Sentence-Transformer for text classification Intermediate	0	2202	March 12, 2024
Identical CLS token embeddings for all different sentences? Beginners	1	451	April 17, 2023
Sentence Embeddings From Fine-Tuned BERTForSequenceClassification 🤗Transformers	1	1677	September 29, 2021

Common practice, using the hidden state associated with [cls] as an input feature for a classification task?

Related topics