CLIPTextModel's get_text_features VS pooled outputs

Hi all,

I am trying to obtain embedding vectors for text using CLIP that will eventually be used for cross-attention with a latent diffusion model (the standard text-to-image approach). For context, the embedding vectors will first pass through some of my own linear projection layers during training before being fed into the UNet.

A common practice for obtaining CLIP text embeddings is to use CLIPTextModel’s get_text_features method. From the code, I can see that it’s projecting the pooled outputs using a linear self.text_projection layer. I have also verified that this is true by comparing against the.pooler_output from CLIPTextModel.

My question is, what’s this self.text_projection for specifically? Its projecting from 768 to 768 so dimensionality doesn’t change. Does the text_projection matter at all in my case if I can just use the pooled outputs directly from CLIPTextModel? I also observe people using pooled outputs for obtaining CLIP’s text embeddings so it’s confusing when to do what here.

Apologies if this question might be a bit nuanced.

I think the projection layer is just there to project the text features (coming from the text encoder) and the image features (coming from the vision encoder) into the same embedding space. The vision encoder uses a different dimensionality (e.g. 1024 for openai/clip-vit-large-patch14 · Hugging Face), so the projection layers on both the text and vision sides make sure this is projected to the same dimensionality (like 512 or 768).

1 Like