StableDiffusionPipeline uses a
CLIPTextModel for the text embeddings, can I match that with the corresponding
CLIPVisionModel to get CLIP test-image similarity of generated images? Which pretrained
CLIPVisionModel should I use?
This is exactly what I am trying to do.
But I am blocked by the fact that the CLIPVisionModel embedding output has a different size than I expected.
In this minimum working example you can see that despite
CLIPVisionModel.config.projection_dim == CLIPTextModel.config.projection_dim==768.
Still, the embeddings present of different shapes:
(patch_embedding): Conv2d(3, 1024, kernel_size=(14, 14), stride=(14, 14), bias=False)
(position_embedding): Embedding(257, 1024)
(token_embedding): Embedding(49408, 768)
(position_embedding): Embedding(77, 768)
I am confused about this and I can’t find much more in the documentation.
You can compute the similarity score between text and image by computing the dot product between the corresponding (normalized) embeddings. In order to do so, you should use the same CLIP Vision model as the one used to retrieve the text embeddings. See for example this snippet here: transformers/modeling_clip.py at v4.24.0 · huggingface/transformers · GitHub. If you invoke that function with multiples images and texts,
logits_per_image would give you the scores of each image across multiple text prompts, and
logits_per_text would give you the score about how much each image matches the corresponding text. You can use one image and one prompt and get a score that would be comparable with other image-text pairs.
There’s an usage example here CLIP that could be a good starting point
This is what I was looking for:
vision_outputs = model.vision_model(pixel_values=inputs.pixel_values)
text_outputs = model.text_model(input_ids=inputs.input_ids)
img_embeds = model.visual_projection(vision_outputs)
txt_embeds = model.text_projection(text_outputs)
img_embeds = img_embeds / img_embeds.norm(p=2, dim=-1, keepdim=True)
txt_embeds = txt_embeds / txt_embeds.norm(p=2, dim=-1, keepdim=True)
(torch.Size([1, 768]), torch.Size([1, 768]))
I was using
Ok, I upgraded and
model.vision_model.embeddings, model.text_model.embeddingsare not the same than projection embeddings. I guess I understood it wrongly, but anyway, I can do what I wanted with the projection.
Oh yes, you are right, you need the projection embeddings
Great that you solved it @fredguth! How did you initalize a working CLIP pipeline from the components of StableDiffusion?