Hi,
Yes for that you need to load the CLIPVisionWithProjection class:
from PIL import Image
import requests
from transformers import AutoProcessor, CLIPVisionModelWithProjection
model = CLIPVisionModelWithProjection.from_pretrained("laion/CLIP-ViT-B-32-laion2B-s34B-b79K")
processor = AutoProcessor.from_pretrained("laion/CLIP-ViT-B-32-laion2B-s34B-b79K")
url = "http://images.cocodataset.org/val2017/000000039769.jpg"
image = Image.open(requests.get(url, stream=True).raw)
inputs = processor(images=image, return_tensors="pt")
outputs = model(**inputs)
image_embeds = outputs.image_embeds
This class includes the projection layer (which projects the image embeddings into the same embedding space as the text embeddings).