How to get an embedding of size 512 using CLIP equal to open_clip?

I was using the pretrained model laion/CLIPViT-B-32-laion2B-s34B-b79K with open clip, as HF suggests:

import open_clip

model, preprocess_train, preprocess_val = open_clip.create_model_and_transforms('hf-hub:laion/CLIP-ViT-B-32-laion2B-s34B-b79K')
tokenizer = open_clip.get_tokenizer('hf-hub:laion/CLIP-ViT-B-32-laion2B-s34B-b79K')

Which gives an embedding of 512.

I would like to use HF’s own CLIPVisionModel and obtain the same result, that is, an embedding of size 512. I’m trying this way with CLIPVisionConfig but I’m still getting an embedding of 768:

from PIL import Image
import requests
from transformers import AutoProcessor, CLIPVisionModel, CLIPVisionConfig

configuration = CLIPVisionConfig(projection_dim=512)

model = CLIPVisionModel.from_pretrained("laion/CLIP-ViT-B-32-laion2B-s34B-b79K", config=configuration)
processor = AutoProcessor.from_pretrained("laion/CLIP-ViT-B-32-laion2B-s34B-b79K")

url = ""
image =, stream=True).raw)

inputs = processor(images=image, return_tensors="pt")

outputs = model(**inputs)
last_hidden_state = outputs.last_hidden_state
pooled_output = outputs.pooler_output  # pooled CLS states

Could anyone help me get the same result?


Yes for that you need to load the CLIPVisionWithProjection class:

from PIL import Image
import requests
from transformers import AutoProcessor, CLIPVisionModelWithProjection

model = CLIPVisionModelWithProjection.from_pretrained("laion/CLIP-ViT-B-32-laion2B-s34B-b79K")
processor = AutoProcessor.from_pretrained("laion/CLIP-ViT-B-32-laion2B-s34B-b79K")

url = ""
image =, stream=True).raw)

inputs = processor(images=image, return_tensors="pt")

outputs = model(**inputs)
image_embeds = outputs.image_embeds

This class includes the projection layer (which projects the image embeddings into the same embedding space as the text embeddings).

It worked! Thank you very much, Niels!

