CLIPTextModel's get_text_features VS pooled outputs

thejinzhe · August 30, 2024, 10:11am

Hi all,

I am trying to obtain embedding vectors for text using CLIP that will eventually be used for cross-attention with a latent diffusion model (the standard text-to-image approach). For context, the embedding vectors will first pass through some of my own linear projection layers during training before being fed into the UNet.

A common practice for obtaining CLIP text embeddings is to use CLIPTextModel’s get_text_features method. From the code, I can see that it’s projecting the pooled outputs using a linear self.text_projection layer. I have also verified that this is true by comparing against the.pooler_output from CLIPTextModel.

My question is, what’s this self.text_projection for specifically? Its projecting from 768 to 768 so dimensionality doesn’t change. Does the text_projection matter at all in my case if I can just use the pooled outputs directly from CLIPTextModel? I also observe people using pooled outputs for obtaining CLIP’s text embeddings so it’s confusing when to do what here.

Apologies if this question might be a bit nuanced.

nielsr · August 30, 2024, 10:34am

I think the projection layer is just there to project the text features (coming from the text encoder) and the image features (coming from the vision encoder) into the same embedding space. The vision encoder uses a different dimensionality (e.g. 1024 for openai/clip-vit-large-patch14 · Hugging Face), so the projection layers on both the text and vision sides make sure this is projected to the same dimensionality (like 512 or 768).

Topic		Replies	Views
How to obtain correct text embeddings from CLIP? 🤗Transformers	1	8926	February 6, 2023
Stable Diffusion CLIP similarity 🧨 Diffusers	6	4584	December 6, 2022
Last hidden state vs pooler output in CLIPVisionModel Beginners	1	8503	November 18, 2022
Retrieve BlipForImageTextRetrieval image features 🤗Transformers	0	391	September 17, 2023
CLIP Image to Text search Beginners	0	897	December 19, 2022

CLIPTextModel's get_text_features VS pooled outputs

Related topics