Last hidden state vs pooler output in CLIPVisionModel

shantanuacharya · November 17, 2022, 11:55pm

Hi, I have an image tensor of shape (1, 3, 224, 224) and I want to get its embedding from the CLIPVisionModel with the openai/clip-vit-base-patch32 weights.

Now when I pass the tensor to the vision model, I get two outputs with the following shapes:

last_hidden_state: [1, 50, 768]
pooler_output: [1, 768]

In one discussion thread, I read that choosing last_hidden_state[:, 0] represents the overall representation of the image while in another thread I read choosing pooler_output will be better. I’m bit confused about this, can anyone please help me with this?

nielsr · November 18, 2022, 8:12am

Hi,

Yes there are typically 2 ways to get a “pooled” representation of an entire image. One is taking the last_hidden_state and average them across the sequence dimension. So you could do last_hidden_state.mean(dim=1) and use this as your image representation.

An alternative representation is indeed the pooler_output, which takes the embedding of the first special CLS token from the last_hidden_state, and applies a layernorm to it as seen here.

I’d say that average pooling all last hidden states vs. taking the CLS token’s representation give similar results. However the original ViT authors actually released a new paper in which they replaced the use of CLS token by average pooling and they got better results.

Topic		Replies	Views
Combine word embedding with visual features from ViT model Models	3	1090	November 11, 2022
How to get a fixed size embedding from the last hidden state of vision models? 🤗Transformers	0	800	April 28, 2022
What is the correct way to create a feature extractor for a hugging face (HF) ViT model? Intermediate	1	1050	April 6, 2023
Difference between CLS hidden state and pooled_output? Beginners	0	1504	March 28, 2022
CLIPTextModel's get_text_features VS pooled outputs 🤗Transformers	1	470	August 30, 2024

Last hidden state vs pooler output in CLIPVisionModel

Related topics