How to get an embedding of size 512 using CLIP equal to open_clip?

luizomatias · February 19, 2024, 5:53pm

I was using the pretrained model laion/CLIPViT-B-32-laion2B-s34B-b79K with open clip, as HF suggests:

import open_clip

model, preprocess_train, preprocess_val = open_clip.create_model_and_transforms('hf-hub:laion/CLIP-ViT-B-32-laion2B-s34B-b79K')
tokenizer = open_clip.get_tokenizer('hf-hub:laion/CLIP-ViT-B-32-laion2B-s34B-b79K')

Which gives an embedding of 512.

I would like to use HF’s own CLIPVisionModel and obtain the same result, that is, an embedding of size 512. I’m trying this way with CLIPVisionConfig but I’m still getting an embedding of 768:

from PIL import Image
import requests
from transformers import AutoProcessor, CLIPVisionModel, CLIPVisionConfig

configuration = CLIPVisionConfig(projection_dim=512)

model = CLIPVisionModel.from_pretrained("laion/CLIP-ViT-B-32-laion2B-s34B-b79K", config=configuration)
processor = AutoProcessor.from_pretrained("laion/CLIP-ViT-B-32-laion2B-s34B-b79K")

url = "http://images.cocodataset.org/val2017/000000039769.jpg"
image = Image.open(requests.get(url, stream=True).raw)

inputs = processor(images=image, return_tensors="pt")

outputs = model(**inputs)
last_hidden_state = outputs.last_hidden_state
pooled_output = outputs.pooler_output  # pooled CLS states

Could anyone help me get the same result?

nielsr · February 19, 2024, 9:19pm

Hi,

Yes for that you need to load the CLIPVisionWithProjection class:

from PIL import Image
import requests
from transformers import AutoProcessor, CLIPVisionModelWithProjection

model = CLIPVisionModelWithProjection.from_pretrained("laion/CLIP-ViT-B-32-laion2B-s34B-b79K")
processor = AutoProcessor.from_pretrained("laion/CLIP-ViT-B-32-laion2B-s34B-b79K")

url = "http://images.cocodataset.org/val2017/000000039769.jpg"
image = Image.open(requests.get(url, stream=True).raw)

inputs = processor(images=image, return_tensors="pt")

outputs = model(**inputs)
image_embeds = outputs.image_embeds

This class includes the projection layer (which projects the image embeddings into the same embedding space as the text embeddings).

luizomatias · February 19, 2024, 9:25pm

It worked! Thank you very much, Niels!

system · February 20, 2024, 9:25am

This topic was automatically closed 12 hours after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Input size of CLIPVisionModel reverts back to default while using pretrained weights Beginners	5	1523	July 24, 2023
Changing Hidden size in Clip Text encoder 🤗Transformers	0	259	February 22, 2024
Discrepancy between OpenAI CLIP and Huggingface CLIP models Models	2	1668	August 19, 2024
Proper way to handle non-square images with CLIP? Models	4	7108	August 13, 2023
Obtain patch embeddings with CLIP Beginners	0	57	November 25, 2024

How to get an embedding of size 512 using CLIP equal to open_clip?

Related topics