How to obtain correct text embeddings from CLIP?

preseshadri · February 6, 2023, 1:59am

I am trying to obtain text embeddings from CLIP as shown below. However, I am confused about the difference between text_embeds vs. pooler_output. According to the documentation, text_embeds is “the text embeddings obtained by applying the projection layer to the pooler_output”, but I am not sure what this means? Are both acceptable to use as text embeddings (if I want to compare text similarity), or is one more correct than the other?

from transformers import CLIPProcessor, CLIPModel
from PIL import Image
import requests

url = "http://images.cocodataset.org/val2017/000000039769.jpg"
image = Image.open(requests.get(url, stream=True).raw)

model = CLIPModel.from_pretrained("openai/clip-vit-large-patch14")
processor = CLIPProcessor.from_pretrained("openai/clip-vit-large-patch14")

inputs = processor(text=["a photo of a cat", "a photo of a dog"], images=image, return_tensors="pt", padding=True)
outputs = model(**inputs)
text_embeds = outputs['text_embeds']
pooler_output = outputs['text_model_output']['pooler_output']

nielsr · February 6, 2023, 1:53pm

Answered here: Obtaining text embeddings from CLIP · Issue #21465 · huggingface/transformers · GitHub

Topic		Replies	Views
CLIPTextModel's get_text_features VS pooled outputs 🤗Transformers	1	455	August 30, 2024
Stable Diffusion CLIP similarity 🧨 Diffusers	6	4581	December 6, 2022
CLIP scores, with vector input rather than image input 🤗Transformers	0	262	April 15, 2024
How is additional text information used for image classification using CLIP? Beginners	0	450	November 5, 2023
Diffusers load custom embedding 🧨 Diffusers	0	47	November 7, 2024

How to obtain correct text embeddings from CLIP?

Related topics