Hi,
the StableDiffusionPipeline uses a CLIPTextModel for the text embeddings, can I match that with the corresponding CLIPVisionModel to get CLIP test-image similarity of generated images? Which pretrained CLIPVisionModel should I use?
You can compute the similarity score between text and image by computing the dot product between the corresponding (normalized) embeddings. In order to do so, you should use the same CLIP Vision model as the one used to retrieve the text embeddings. See for example this snippet here: transformers/modeling_clip.py at v4.24.0 · huggingface/transformers · GitHub. If you invoke that function with multiples images and texts, logits_per_image would give you the scores of each image across multiple text prompts, and logits_per_text would give you the score about how much each image matches the corresponding text. You can use one image and one prompt and get a score that would be comparable with other image-text pairs.
There’s an usage example here CLIP that could be a good starting point
Ok, I upgraded and model.vision_model.embeddings, model.text_model.embeddingsare not the same than projection embeddings. I guess I understood it wrongly, but anyway, I can do what I wanted with the projection.