# Last hidden state vs pooler output in CLIPVisionModel

Hi, I have an image tensor of shape (1, 3, 224, 224) and I want to get its embedding from the `CLIPVisionModel` with the `openai/clip-vit-base-patch32` weights.

Now when I pass the tensor to the vision model, I get two outputs with the following shapes:

• `last_hidden_state`: [1, 50, 768]
• `pooler_output`: [1, 768]

In one discussion thread, I read that choosing `last_hidden_state[:, 0]` represents the overall representation of the image while in another thread I read choosing `pooler_output` will be better. I’m bit confused about this, can anyone please help me with this?

Hi,

Yes there are typically 2 ways to get a “pooled” representation of an entire image. One is taking the `last_hidden_state` and average them across the sequence dimension. So you could do `last_hidden_state.mean(dim=1)` and use this as your image representation.

An alternative representation is indeed the `pooler_output`, which takes the embedding of the first special CLS token from the `last_hidden_state`, and applies a layernorm to it as seen here.

I’d say that average pooling all last hidden states vs. taking the CLS token’s representation give similar results. However the original ViT authors actually released a new paper in which they replaced the use of CLS token by average pooling and they got better results.

1 Like