I was using the pretrained model laion/CLIPViT-B-32-laion2B-s34B-b79K with open clip, as HF suggests:
import open_clip
model, preprocess_train, preprocess_val = open_clip.create_model_and_transforms('hf-hub:laion/CLIP-ViT-B-32-laion2B-s34B-b79K')
tokenizer = open_clip.get_tokenizer('hf-hub:laion/CLIP-ViT-B-32-laion2B-s34B-b79K')
Which gives an embedding of 512.
I would like to use HF’s own CLIPVisionModel and obtain the same result, that is, an embedding of size 512. I’m trying this way with CLIPVisionConfig but I’m still getting an embedding of 768:
from PIL import Image
import requests
from transformers import AutoProcessor, CLIPVisionModel, CLIPVisionConfig
configuration = CLIPVisionConfig(projection_dim=512)
model = CLIPVisionModel.from_pretrained("laion/CLIP-ViT-B-32-laion2B-s34B-b79K", config=configuration)
processor = AutoProcessor.from_pretrained("laion/CLIP-ViT-B-32-laion2B-s34B-b79K")
url = "http://images.cocodataset.org/val2017/000000039769.jpg"
image = Image.open(requests.get(url, stream=True).raw)
inputs = processor(images=image, return_tensors="pt")
outputs = model(**inputs)
last_hidden_state = outputs.last_hidden_state
pooled_output = outputs.pooler_output # pooled CLS states
Could anyone help me get the same result?