Stable Diffusion CLIP similarity

the StableDiffusionPipeline uses a CLIPTextModel for the text embeddings, can I match that with the corresponding CLIPVisionModel to get CLIP test-image similarity of generated images? Which pretrained CLIPVisionModel should I use?


This is exactly what I am trying to do.

But I am blocked by the fact that the CLIPVisionModel embedding output has a different size than I expected.

In this minimum working example you can see that despite CLIPVisionModel.config.projection_dim == CLIPTextModel.config.projection_dim==768.

Still, the embeddings present of different shapes:

  (patch_embedding): Conv2d(3, 1024, kernel_size=(14, 14), stride=(14, 14), bias=False)
  (position_embedding): Embedding(257, 1024)
  (token_embedding): Embedding(49408, 768)
  (position_embedding): Embedding(77, 768)

I am confused about this and I can’t find much more in the documentation.


You can compute the similarity score between text and image by computing the dot product between the corresponding (normalized) embeddings. In order to do so, you should use the same CLIP Vision model as the one used to retrieve the text embeddings. See for example this snippet here: transformers/ at v4.24.0 · huggingface/transformers · GitHub. If you invoke that function with multiples images and texts, logits_per_image would give you the scores of each image across multiple text prompts, and logits_per_text would give you the score about how much each image matches the corresponding text. You can use one image and one prompt and get a score that would be comparable with other image-text pairs.

There’s an usage example here CLIP that could be a good starting point :slight_smile:

1 Like

This is what I was looking for:

with torch.no_grad():
    vision_outputs = model.vision_model(pixel_values=inputs.pixel_values)
    text_outputs = model.text_model(input_ids=inputs.input_ids)
    img_embeds = model.visual_projection(vision_outputs[1])
    txt_embeds = model.text_projection(text_outputs[1])
    img_embeds = img_embeds / img_embeds.norm(p=2, dim=-1, keepdim=True)
    txt_embeds = txt_embeds / txt_embeds.norm(p=2, dim=-1, keepdim=True)
img_embeds.shape, txt_embeds.shape

(torch.Size([1, 768]), torch.Size([1, 768]))

I was using transformers ‘4.23.1’…

Ok, I upgraded and model.vision_model.embeddings, model.text_model.embeddingsare not the same than projection embeddings. I guess I understood it wrongly, but anyway, I can do what I wanted with the projection.

1 Like

Oh yes, you are right, you need the projection embeddings :slight_smile:

1 Like

Great that you solved it @fredguth! How did you initalize a working CLIP pipeline from the components of StableDiffusion?