Stable Diffusion CLIP similarity

Cajoek · October 28, 2022, 7:14am

Hi,
the StableDiffusionPipeline uses a CLIPTextModel for the text embeddings, can I match that with the corresponding CLIPVisionModel to get CLIP test-image similarity of generated images? Which pretrained CLIPVisionModel should I use?

Thanks!

fredguth · November 1, 2022, 3:30pm

This is exactly what I am trying to do.

But I am blocked by the fact that the CLIPVisionModel embedding output has a different size than I expected.

In this minimum working example you can see that despite CLIPVisionModel.config.projection_dim == CLIPTextModel.config.projection_dim==768.

Still, the embeddings present of different shapes:

CLIPVisionEmbeddings(
  (patch_embedding): Conv2d(3, 1024, kernel_size=(14, 14), stride=(14, 14), bias=False)
  (position_embedding): Embedding(257, 1024)
)
CLIPTextEmbeddings(
  (token_embedding): Embedding(49408, 768)
  (position_embedding): Embedding(77, 768)
)

I am confused about this and I can’t find much more in the documentation.

pcuenq · November 1, 2022, 4:50pm

Hello!

You can compute the similarity score between text and image by computing the dot product between the corresponding (normalized) embeddings. In order to do so, you should use the same CLIP Vision model as the one used to retrieve the text embeddings. See for example this snippet here: transformers/modeling_clip.py at v4.24.0 · huggingface/transformers · GitHub. If you invoke that function with multiples images and texts, logits_per_image would give you the scores of each image across multiple text prompts, and logits_per_text would give you the score about how much each image matches the corresponding text. You can use one image and one prompt and get a score that would be comparable with other image-text pairs.

There’s an usage example here CLIP that could be a good starting point

fredguth · November 1, 2022, 5:19pm

Thanks!
This is what I was looking for:

with torch.no_grad():
    vision_outputs = model.vision_model(pixel_values=inputs.pixel_values)
    text_outputs = model.text_model(input_ids=inputs.input_ids)
    img_embeds = model.visual_projection(vision_outputs[1])
    txt_embeds = model.text_projection(text_outputs[1])
    img_embeds = img_embeds / img_embeds.norm(p=2, dim=-1, keepdim=True)
    txt_embeds = txt_embeds / txt_embeds.norm(p=2, dim=-1, keepdim=True)
img_embeds.shape, txt_embeds.shape

(torch.Size([1, 768]), torch.Size([1, 768]))

I was using transformers ‘4.23.1’…

fredguth · November 1, 2022, 5:27pm

Ok, I upgraded and model.vision_model.embeddings, model.text_model.embeddingsare not the same than projection embeddings. I guess I understood it wrongly, but anyway, I can do what I wanted with the projection.

pcuenq · November 1, 2022, 5:56pm

Oh yes, you are right, you need the projection embeddings

Cajoek · December 6, 2022, 5:07pm

Great that you solved it @fredguth! How did you initalize a working CLIP pipeline from the components of StableDiffusion?

Thanks!

Topic		Replies	Views
Stable diffusion text_to_image.py discussion 🧨 Diffusers	1	358	May 22, 2023
How to condition Stable-Diffusion on CLIP image embeddings? 🧨 Diffusers	0	1293	February 4, 2024
Help verify StableDiffusion & CLIP weight sharing 🧨 Diffusers	0	527	December 13, 2022
Access CLIP from StableDiffusionPipeline and use the same models for multiple pipelines 🧨 Diffusers	3	2599	October 11, 2023
How to combine Image and Text embedding for product similarity Models	2	16862	May 6, 2025

Stable Diffusion CLIP similarity

Related topics