I am trying to obtain text embeddings from CLIP as shown below. However, I am confused about the difference between text_embeds vs. pooler_output. According to the documentation, text_embeds is “the text embeddings obtained by applying the projection layer to the pooler_output”, but I am not sure what this means? Are both acceptable to use as text embeddings (if I want to compare text similarity), or is one more correct than the other?
from transformers import CLIPProcessor, CLIPModel from PIL import Image import requests url = "http://images.cocodataset.org/val2017/000000039769.jpg" image = Image.open(requests.get(url, stream=True).raw) model = CLIPModel.from_pretrained("openai/clip-vit-large-patch14") processor = CLIPProcessor.from_pretrained("openai/clip-vit-large-patch14") inputs = processor(text=["a photo of a cat", "a photo of a dog"], images=image, return_tensors="pt", padding=True) outputs = model(**inputs) text_embeds = outputs['text_embeds'] pooler_output = outputs['text_model_output']['pooler_output']