Hi all,
I am trying to obtain embedding vectors for text using CLIP that will eventually be used for cross-attention with a latent diffusion model (the standard text-to-image approach). For context, the embedding vectors will first pass through some of my own linear projection layers during training before being fed into the UNet.
A common practice for obtaining CLIP text embeddings is to use CLIPTextModel
’s get_text_features
method. From the code, I can see that it’s projecting the pooled outputs using a linear self.text_projection
layer. I have also verified that this is true by comparing against the.pooler_output
from CLIPTextModel
.
My question is, what’s this self.text_projection
for specifically? Its projecting from 768 to 768 so dimensionality doesn’t change. Does the text_projection
matter at all in my case if I can just use the pooled outputs directly from CLIPTextModel
? I also observe people using pooled outputs for obtaining CLIP’s text embeddings so it’s confusing when to do what here.
Apologies if this question might be a bit nuanced.