I have fine-tuned openai/clip-vit-large-patch14 → https://huggingface.co/zer0int/CLIP-GmP-ViT-L-14/
I finally figured out you need metadata in the model and all, and it ‘works’ (as in, my model loads):
from transformers import CLIPProcessor, CLIPModel
model_id = "zer0int/CLIP-GmP-ViT-L-14"
model = CLIPModel.from_pretrained(model_id)
processor = CLIPProcessor.from_pretrained(model_id)
But, comparing:
CLIPModel.from_pretrained("openai/clip-vit-large-patch14")
Cosine similarity (image vs 'A photo of a cat'): 0.2330581396818161
Cosine similarity (image vs 'A picture of a dog'): 0.15255104005336761
Cosine similarity (image vs 'cat'): 0.21000739932060242
Cosine similarity (image vs 'dog'): 0.14514459669589996
I have re-converted the original model to HuggingFace model.safetensors, from my original torch.save pickle file (fine-tuned with “import clip”), and using the original OpenAI/CLIP as a ‘donor’ for missing ‘position_ids’ as well as the ‘syntax inspiration’. All keys match. logit_scale matches. Still, when I load my model, I always something along the lines of:
CLIPModel.from_pretrained("zer0int/CLIP-GmP-ViT-L-14")
Image vs 'A photo of a cat': 0.05461934581398964
Image vs 'A picture of a dog': 0.030599746853113174
Image vs 'cat': -0.0010263863950967789
Image vs 'dog': 0.004391679540276527
I have no idea what is going on with that messed up cosine similarity. I am obviously doing something wrong.
And, to make sure, I am loading the exact same model file I have used for conversion to .safetensors - but this time, I am using the pytorch-pickle.pt file. Loading the exact same model, just not in huggingface format. And I get:
Cosine similarity (image vs 'A photo of a cat'): 0.2086181640625
Cosine similarity (image vs 'A picture of a dog'): 0.08636474609375
Cosine similarity (image vs 'cat'): 0.1849365234375
Cosine similarity (image vs 'dog'): 0.0947265625
This is absolutely as expected. Slightly less confident than original CLIP about this being a “cat” - but absolutely SUPER confident that this is NOT a dog.
That re-organization of embeddings is why my model outperforms the original one.
No idea what I am doing wrong.
I just “stole” the original config etc. from “openai/clip-vit-large-patch14”. I did NOT change / re-train the tokenizer. My model is just a ‘normal’ CLIP ViT-L/14, fine-tuned.
I saw there was a “SFconvertbot” that apparently created the .safetensors file for OpenAI’s original model. Do I just have to upload one model separately, as a pickle (.bin?), and your bot will come by and fix this? =)
Any help is appreciated - from bot or human alike! Thank you!