Converting weights to .safetensors with HF format -> CLIP-L is ruined. Why?

I have fine-tuned openai/clip-vit-large-patch14 → https://huggingface.co/zer0int/CLIP-GmP-ViT-L-14/

I finally figured out you need metadata in the model and all, and it ‘works’ (as in, my model loads):

from transformers import CLIPProcessor, CLIPModel

model_id = "zer0int/CLIP-GmP-ViT-L-14"

model = CLIPModel.from_pretrained(model_id)
processor = CLIPProcessor.from_pretrained(model_id)

But, comparing:

CLIPModel.from_pretrained("openai/clip-vit-large-patch14")


Cosine similarity (image vs 'A photo of a cat'): 0.2330581396818161
Cosine similarity (image vs 'A picture of a dog'): 0.15255104005336761
Cosine similarity (image vs 'cat'): 0.21000739932060242
Cosine similarity (image vs 'dog'): 0.14514459669589996

I have re-converted the original model to HuggingFace model.safetensors, from my original torch.save pickle file (fine-tuned with “import clip”), and using the original OpenAI/CLIP as a ‘donor’ for missing ‘position_ids’ as well as the ‘syntax inspiration’. All keys match. logit_scale matches. Still, when I load my model, I always something along the lines of:

CLIPModel.from_pretrained("zer0int/CLIP-GmP-ViT-L-14")

Image vs 'A photo of a cat': 0.05461934581398964
Image vs 'A picture of a dog': 0.030599746853113174
Image vs 'cat': -0.0010263863950967789
Image vs 'dog': 0.004391679540276527

I have no idea what is going on with that messed up cosine similarity. I am obviously doing something wrong.

And, to make sure, I am loading the exact same model file I have used for conversion to .safetensors - but this time, I am using the pytorch-pickle.pt file. Loading the exact same model, just not in huggingface format. And I get:

Cosine similarity (image vs 'A photo of a cat'): 0.2086181640625
Cosine similarity (image vs 'A picture of a dog'): 0.08636474609375
Cosine similarity (image vs 'cat'): 0.1849365234375
Cosine similarity (image vs 'dog'): 0.0947265625

This is absolutely as expected. Slightly less confident than original CLIP about this being a “cat” - but absolutely SUPER confident that this is NOT a dog.
That re-organization of embeddings is why my model outperforms the original one.

No idea what I am doing wrong. :hugs:
I just “stole” the original config etc. from “openai/clip-vit-large-patch14”. I did NOT change / re-train the tokenizer. My model is just a ‘normal’ CLIP ViT-L/14, fine-tuned.

I saw there was a “SFconvertbot” that apparently created the .safetensors file for OpenAI’s original model. Do I just have to upload one model separately, as a pickle (.bin?), and your bot will come by and fix this? =)

Any help is appreciated - from bot or human alike! Thank you!

I don’t know if this is related, but if, for example, adjustments are made in ComfyUI, the key names may have changed on their own. As shown below.
As for the rest, there should be no difference in structure between Diffusers and the rest of the other components, aside from UNET and VAE…

text_model.encoder.layers.0.layer_norm2.weight => text_encoders.clip_l.transformer.text_model.encoder.layers.0.layer_norm2.weight

Thank you for your response!
Yes, it fortunately seems like just the Text Encoder of CLIP works fine as-is in HuggingFace Safetensors format. ComfyUI also handles a state_dict.pt in original OpenAI “import clip” format (naming) and converts it appropriately, so it can take either .safetensors or any pickle format just fine - and it seems to produce the same results. While e.g. Forge and some gguf / quantization stuff people were exploring apparently needs everything to be in the HF format, I was told (I only use ComfyUI).

So, it seems like it’s a specific issue with trying to make my full model - text and vision - work with the transformers library. I must be doing something wrong during the conversion, and I can’t figure out what it is. :upside_down_face:

AI-related things are advancing so fast that the specifications are not set in stone, so to speak, they are all like proprietary specifications, so the compatibility issue is interesting or worth looking into.
There were no errors in reading or writing these safetensors files, and there should be no formatting inconsistencies for the CLIP model when it can be from_pretrained and save_pretrained. That one gives errors quite severely.
There were also no unwanted extras given when the model was packaged in ComfyUI.
Also, if there are parts that are manually converted, it is possible that a bug occurred there, but I don’t think that would be enough to cause a malfunction…

Now all we can do is to play the game of looking for mistakes…:sweat_smile:

FLUX Original CLIP keys

CLIPs from your repo keys

Test Script

import torch
from safetensors.torch import load_file, save_file
from pathlib import Path

def normalize_key(k: str):
    return k.replace("vae.", "").replace("model.diffusion_model.", "")\
        .replace("text_encoders.clip_l.transformer.", "")\
        .replace("text_encoders.t5xxl.transformer.", "")

filename = "model_original.safetensors"
savename = Path(filename).stem + "_fixed" + Path(filename).suffix
oldlogname = Path(filename).stem + "_fixed" + ".old_keys.txt"
newlogname = Path(filename).stem + "_fixed" + ".new_keys.txt"

state_dict = load_file(filename)
new_sd = dict()

keys_old = []
keys_new = []
with torch.no_grad():
    try:
        for k, v in state_dict.items():
            nkey = normalize_key(k)
            print(f"{k} => {nkey}")
            keys_old.append(k)
            keys_new.append(nkey)
            new_sd[k] = v
    except Exception as e:
        print(e)

#save_file(new_sd, savename)

with open(oldlogname, encoding='utf-8', mode='w') as f:
    f.write("\n".join(keys_old))
with open(newlogname, encoding='utf-8', mode='w') as f:
    f.write("\n".join(keys_new))

Speaking of which, do you think this should eventually work as a stand-alone CLIP text model?
Or one that is integrated into SDXL or Flux?

P.S.

If we can load it with CLIPTextModel, we can handle the rest.
But the loading itself is already done in the first place…it’s just the behavior is weird.:nauseated_face:
I looked at the create_diffusers_clip_model_from_ldm function, but it just discards unused parts of the CLIP model and fixes prefixes.

from diffusers.loaders.single_file_utils import create_diffusers_clip_model_from_ldm
from transformers import CLIPTextModel

# CLIP
clip_ = create_diffusers_clip_model_from_ldm(
    cls=CLIPTextModel, checkpoint=sd, config="black-forest-labs/FLUX.1-dev", subfolder="text_encoder"
)

P.S.

Maybe we have to use CLIPTextModel instead of CLIPModel for compatibility?

Using it as a Text Encoder for a text-to-image generative AI should indeed be fine (if you download the respective files, and don’t use the transformers library to import the ‘broken’ ‘model.safetensors’ from my HF).

But it does work as a stand-alone model. At least as long as you’re not using the HuggingFace transformers .safetensors. If you dare to load the state_dict.pt version into a ViT-L/14 with “import clip”, you’ll see it has better performance (compared to OpenAI pre-trained) on tasks like zero-shot MVT ImageNet/ObjectNet or VoC-2007-multilabel:

edited, can’t post more than one image as n00b

It also has a different attention (more robust on the actual object):

edited, can’t post more than one image as n00b

…And that, in turn, makes it more robust to e.g. adversarial / typographic attack images (the classic, where text is in the image, and CLIP ‘obsesses’ about the text and fails to see the rest); this is a Long-CLIP, but fine-tuned in the same manner (Geometric Parametrization); it is much more confident about this really being an ‘apple’, and not an ‘ipod’. More so, when predicting its own ‘opinion’ (gradient ascent to optimize text embeddings for cosine similarity with given image embeddings), it seems to frequently “see what you did there”, calling it a “productfakespeare” or “absurd hoax apple”, or a “fakepods apple”. :-))

All three above mentioned images merged into one:

PS: Code for all of that (gradient ascent, fine-tune, and more) is here, if you’re interested: github.com/zer0int.

I’ll have to look into the rest you’ve provided in a few hours / in the afternoon, when I have more time; thanks in advance! :+1:

1 Like

But it does work as a stand-alone model. At least as long as you’re not using the HuggingFace transformers .safetensors. If you dare to load the state_dict.pt version into a ViT-L/14 with “import clip”,

Perhaps you have left out this procedure?

sd = torch.load("yourmodel.pth") # or safetensors.load_file
model = CLIPModel() # maybe it needs some config file actually
model.from_state_dict(sd)
model.save_pretrained("your_hfmodel_path")

With this:

import clip
import torch
from transformers import CLIPModel, CLIPConfig
from safetensors.torch import save_file

# Load your fine-tuned OpenAI CLIP model from the torch pickle
openai_model_path = "full-ViT-L-ft.pt"
openai_model, _ = clip.load(openai_model_path, device="cpu")

# Load the state_dict from the OpenAI model
state_dict = openai_model.state_dict()

# Define the Hugging Face CLIPConfig (assuming it's ViT-L/14)
config = CLIPConfig.from_pretrained("openai/clip-vit-large-patch14")
hf_model = CLIPModel(config)

# Map the OpenAI CLIP state_dict to Hugging Face CLIPModel state_dict
hf_model.load_state_dict(state_dict, strict=False)  # strict=False allows for differences in metadata

missing_keys, unexpected_keys = hf_model.load_state_dict(state_dict, strict=False)
print(f"Missing keys: {missing_keys}")
print(f"Unexpected keys: {unexpected_keys}")

save_file(hf_model.state_dict(), "fromraw-converted_hf_model.safetensors")

I now have a bunch of new numbers. Very, very wrong numbers, but they’re different! :rofl:

Image vs 'A photo of a cat': 0.02061634138226509
Image vs 'A picture of a dog': 0.015210384503006935
Image vs 'cat': 0.002116560935974121
Image vs 'dog': 0.027614593505859375

How about this!

#import clip
import torch
import os
from transformers import CLIPModel, CLIPConfig
from safetensors.torch import save_file, load_file
from huggingface_hub import save_torch_model

# Load your fine-tuned OpenAI CLIP model from the torch pickle
openai_model_path = "ViT-L-14-GmP-ft.safetensors"
#openai_model, _ = clip.load(openai_model_path, device="cpu")

# Load the state_dict from the OpenAI model
#state_dict = openai_model.state_dict()
state_dict = load_file(openai_model_path, device="cpu")

# Define the Hugging Face CLIPConfig (assuming it's ViT-L/14)
config = CLIPConfig.from_pretrained("openai/clip-vit-large-patch14")
hf_model = CLIPModel(config)

# Map the OpenAI CLIP state_dict to Hugging Face CLIPModel state_dict
hf_model.load_state_dict(state_dict, strict=False)  # strict=False allows for differences in metadata

#missing_keys, unexpected_keys = hf_model.load_state_dict(state_dict, strict=False)
#print(f"Missing keys: {missing_keys}")
#print(f"Unexpected keys: {unexpected_keys}")

#save_file(hf_model.state_dict(), "fromraw-converted_hf_model.safetensors")
save_dir = "fromraw-converted_hf_model"
os.makedirs(save_dir)
save_torch_model(model=hf_model, save_directory=save_dir)
# https://huggingface.co/docs/huggingface_hub/v0.25.0/package_reference/serialization