How to obtain correct text embeddings from CLIP?

I am trying to obtain text embeddings from CLIP as shown below. However, I am confused about the difference between text_embeds vs. pooler_output. According to the documentation, text_embeds is “the text embeddings obtained by applying the projection layer to the pooler_output”, but I am not sure what this means? Are both acceptable to use as text embeddings (if I want to compare text similarity), or is one more correct than the other?

from transformers import CLIPProcessor, CLIPModel
from PIL import Image
import requests

url = ""
image =, stream=True).raw)

model = CLIPModel.from_pretrained("openai/clip-vit-large-patch14")
processor = CLIPProcessor.from_pretrained("openai/clip-vit-large-patch14")

inputs = processor(text=["a photo of a cat", "a photo of a dog"], images=image, return_tensors="pt", padding=True)
outputs = model(**inputs)
text_embeds = outputs['text_embeds']
pooler_output = outputs['text_model_output']['pooler_output']

Answered here: Obtaining text embeddings from CLIP · Issue #21465 · huggingface/transformers · GitHub