Converting weights to .safetensors with HF format -> CLIP-L is ruined. Why?

I have fine-tuned openai/clip-vit-large-patch14 → https://huggingface.co/zer0int/CLIP-GmP-ViT-L-14/

I finally figured out you need metadata in the model and all, and it ‘works’ (as in, my model loads):

from transformers import CLIPProcessor, CLIPModel

model_id = "zer0int/CLIP-GmP-ViT-L-14"

model = CLIPModel.from_pretrained(model_id)
processor = CLIPProcessor.from_pretrained(model_id)

But, comparing:

CLIPModel.from_pretrained("openai/clip-vit-large-patch14")


Cosine similarity (image vs 'A photo of a cat'): 0.2330581396818161
Cosine similarity (image vs 'A picture of a dog'): 0.15255104005336761
Cosine similarity (image vs 'cat'): 0.21000739932060242
Cosine similarity (image vs 'dog'): 0.14514459669589996

I have re-converted the original model to HuggingFace model.safetensors, from my original torch.save pickle file (fine-tuned with “import clip”), and using the original OpenAI/CLIP as a ‘donor’ for missing ‘position_ids’ as well as the ‘syntax inspiration’. All keys match. logit_scale matches. Still, when I load my model, I always something along the lines of:

CLIPModel.from_pretrained("zer0int/CLIP-GmP-ViT-L-14")

Image vs 'A photo of a cat': 0.05461934581398964
Image vs 'A picture of a dog': 0.030599746853113174
Image vs 'cat': -0.0010263863950967789
Image vs 'dog': 0.004391679540276527

I have no idea what is going on with that messed up cosine similarity. I am obviously doing something wrong.

And, to make sure, I am loading the exact same model file I have used for conversion to .safetensors - but this time, I am using the pytorch-pickle.pt file. Loading the exact same model, just not in huggingface format. And I get:

Cosine similarity (image vs 'A photo of a cat'): 0.2086181640625
Cosine similarity (image vs 'A picture of a dog'): 0.08636474609375
Cosine similarity (image vs 'cat'): 0.1849365234375
Cosine similarity (image vs 'dog'): 0.0947265625

This is absolutely as expected. Slightly less confident than original CLIP about this being a “cat” - but absolutely SUPER confident that this is NOT a dog.
That re-organization of embeddings is why my model outperforms the original one.

No idea what I am doing wrong. :hugs:
I just “stole” the original config etc. from “openai/clip-vit-large-patch14”. I did NOT change / re-train the tokenizer. My model is just a ‘normal’ CLIP ViT-L/14, fine-tuned.

I saw there was a “SFconvertbot” that apparently created the .safetensors file for OpenAI’s original model. Do I just have to upload one model separately, as a pickle (.bin?), and your bot will come by and fix this? =)

Any help is appreciated - from bot or human alike! Thank you!

I don’t know if this is related, but if, for example, adjustments are made in ComfyUI, the key names may have changed on their own. As shown below.
As for the rest, there should be no difference in structure between Diffusers and the rest of the other components, aside from UNET and VAE…

text_model.encoder.layers.0.layer_norm2.weight => text_encoders.clip_l.transformer.text_model.encoder.layers.0.layer_norm2.weight

Thank you for your response!
Yes, it fortunately seems like just the Text Encoder of CLIP works fine as-is in HuggingFace Safetensors format. ComfyUI also handles a state_dict.pt in original OpenAI “import clip” format (naming) and converts it appropriately, so it can take either .safetensors or any pickle format just fine - and it seems to produce the same results. While e.g. Forge and some gguf / quantization stuff people were exploring apparently needs everything to be in the HF format, I was told (I only use ComfyUI).

So, it seems like it’s a specific issue with trying to make my full model - text and vision - work with the transformers library. I must be doing something wrong during the conversion, and I can’t figure out what it is. :upside_down_face:

AI-related things are advancing so fast that the specifications are not set in stone, so to speak, they are all like proprietary specifications, so the compatibility issue is interesting or worth looking into.
There were no errors in reading or writing these safetensors files, and there should be no formatting inconsistencies for the CLIP model when it can be from_pretrained and save_pretrained. That one gives errors quite severely.
There were also no unwanted extras given when the model was packaged in ComfyUI.
Also, if there are parts that are manually converted, it is possible that a bug occurred there, but I don’t think that would be enough to cause a malfunction…

Now all we can do is to play the game of looking for mistakes…:sweat_smile:

FLUX Original CLIP keys

CLIPs from your repo keys

Test Script

import torch
from safetensors.torch import load_file, save_file
from pathlib import Path

def normalize_key(k: str):
    return k.replace("vae.", "").replace("model.diffusion_model.", "")\
        .replace("text_encoders.clip_l.transformer.", "")\
        .replace("text_encoders.t5xxl.transformer.", "")

filename = "model_original.safetensors"
savename = Path(filename).stem + "_fixed" + Path(filename).suffix
oldlogname = Path(filename).stem + "_fixed" + ".old_keys.txt"
newlogname = Path(filename).stem + "_fixed" + ".new_keys.txt"

state_dict = load_file(filename)
new_sd = dict()

keys_old = []
keys_new = []
with torch.no_grad():
    try:
        for k, v in state_dict.items():
            nkey = normalize_key(k)
            print(f"{k} => {nkey}")
            keys_old.append(k)
            keys_new.append(nkey)
            new_sd[k] = v
    except Exception as e:
        print(e)

#save_file(new_sd, savename)

with open(oldlogname, encoding='utf-8', mode='w') as f:
    f.write("\n".join(keys_old))
with open(newlogname, encoding='utf-8', mode='w') as f:
    f.write("\n".join(keys_new))

Speaking of which, do you think this should eventually work as a stand-alone CLIP text model?
Or one that is integrated into SDXL or Flux?

P.S.

If we can load it with CLIPTextModel, we can handle the rest.
But the loading itself is already done in the first place…it’s just the behavior is weird.:nauseated_face:
I looked at the create_diffusers_clip_model_from_ldm function, but it just discards unused parts of the CLIP model and fixes prefixes.

from diffusers.loaders.single_file_utils import create_diffusers_clip_model_from_ldm
from transformers import CLIPTextModel

# CLIP
clip_ = create_diffusers_clip_model_from_ldm(
    cls=CLIPTextModel, checkpoint=sd, config="black-forest-labs/FLUX.1-dev", subfolder="text_encoder"
)

P.S.

Maybe we have to use CLIPTextModel instead of CLIPModel for compatibility?

Using it as a Text Encoder for a text-to-image generative AI should indeed be fine (if you download the respective files, and don’t use the transformers library to import the ‘broken’ ‘model.safetensors’ from my HF).

But it does work as a stand-alone model. At least as long as you’re not using the HuggingFace transformers .safetensors. If you dare to load the state_dict.pt version into a ViT-L/14 with “import clip”, you’ll see it has better performance (compared to OpenAI pre-trained) on tasks like zero-shot MVT ImageNet/ObjectNet or VoC-2007-multilabel:

edited, can’t post more than one image as n00b

It also has a different attention (more robust on the actual object):

edited, can’t post more than one image as n00b

…And that, in turn, makes it more robust to e.g. adversarial / typographic attack images (the classic, where text is in the image, and CLIP ‘obsesses’ about the text and fails to see the rest); this is a Long-CLIP, but fine-tuned in the same manner (Geometric Parametrization); it is much more confident about this really being an ‘apple’, and not an ‘ipod’. More so, when predicting its own ‘opinion’ (gradient ascent to optimize text embeddings for cosine similarity with given image embeddings), it seems to frequently “see what you did there”, calling it a “productfakespeare” or “absurd hoax apple”, or a “fakepods apple”. :-))

All three above mentioned images merged into one:

PS: Code for all of that (gradient ascent, fine-tune, and more) is here, if you’re interested: github.com/zer0int.

I’ll have to look into the rest you’ve provided in a few hours / in the afternoon, when I have more time; thanks in advance! :+1:

1 Like

But it does work as a stand-alone model. At least as long as you’re not using the HuggingFace transformers .safetensors. If you dare to load the state_dict.pt version into a ViT-L/14 with “import clip”,

Perhaps you have left out this procedure?

sd = torch.load("yourmodel.pth") # or safetensors.load_file
model = CLIPModel() # maybe it needs some config file actually
model.from_state_dict(sd)
model.save_pretrained("your_hfmodel_path")

With this:

import clip
import torch
from transformers import CLIPModel, CLIPConfig
from safetensors.torch import save_file

# Load your fine-tuned OpenAI CLIP model from the torch pickle
openai_model_path = "full-ViT-L-ft.pt"
openai_model, _ = clip.load(openai_model_path, device="cpu")

# Load the state_dict from the OpenAI model
state_dict = openai_model.state_dict()

# Define the Hugging Face CLIPConfig (assuming it's ViT-L/14)
config = CLIPConfig.from_pretrained("openai/clip-vit-large-patch14")
hf_model = CLIPModel(config)

# Map the OpenAI CLIP state_dict to Hugging Face CLIPModel state_dict
hf_model.load_state_dict(state_dict, strict=False)  # strict=False allows for differences in metadata

missing_keys, unexpected_keys = hf_model.load_state_dict(state_dict, strict=False)
print(f"Missing keys: {missing_keys}")
print(f"Unexpected keys: {unexpected_keys}")

save_file(hf_model.state_dict(), "fromraw-converted_hf_model.safetensors")

I now have a bunch of new numbers. Very, very wrong numbers, but they’re different! :rofl:

Image vs 'A photo of a cat': 0.02061634138226509
Image vs 'A picture of a dog': 0.015210384503006935
Image vs 'cat': 0.002116560935974121
Image vs 'dog': 0.027614593505859375

How about this!

#import clip
import torch
import os
from transformers import CLIPModel, CLIPConfig
from safetensors.torch import save_file, load_file
from huggingface_hub import save_torch_model

# Load your fine-tuned OpenAI CLIP model from the torch pickle
openai_model_path = "ViT-L-14-GmP-ft.safetensors"
#openai_model, _ = clip.load(openai_model_path, device="cpu")

# Load the state_dict from the OpenAI model
#state_dict = openai_model.state_dict()
state_dict = load_file(openai_model_path, device="cpu")

# Define the Hugging Face CLIPConfig (assuming it's ViT-L/14)
config = CLIPConfig.from_pretrained("openai/clip-vit-large-patch14")
hf_model = CLIPModel(config)

# Map the OpenAI CLIP state_dict to Hugging Face CLIPModel state_dict
hf_model.load_state_dict(state_dict, strict=False)  # strict=False allows for differences in metadata

#missing_keys, unexpected_keys = hf_model.load_state_dict(state_dict, strict=False)
#print(f"Missing keys: {missing_keys}")
#print(f"Unexpected keys: {unexpected_keys}")

#save_file(hf_model.state_dict(), "fromraw-converted_hf_model.safetensors")
save_dir = "fromraw-converted_hf_model"
os.makedirs(save_dir)
save_torch_model(model=hf_model, save_directory=save_dir)
# https://huggingface.co/docs/huggingface_hub/v0.25.0/package_reference/serialization

Nope. Still the same erratic numbers; I tried from various starting points:

  • Convert without changes (load .pt, save as .safetensors, use your script - no prior conversion to HF naming scheme)
  • Try previously converted-to-huggingface .safetensors → use your script.

Same difference; erratic cos sim of ~0 (.3f or .4f, as before).

I thought about it. Actually, there is one difference in OpenAI’s original ViT-L-14.pt vs. my fine-tune… Mine is a simple torch.save, while OpenAI’s is a TorchScript / jit. Maybe that makes a difference…?

PS: Thank you very much your perseverance, I really appreciate it! :+1:

1 Like

Well, conversion is something of my hobby.:hugs:

there is one difference in OpenAI’s original ViT-L-14.pt vs. my fine-tune… Mine is a simple torch.save, while OpenAI’s is a TorchScript / jit. Maybe that makes a difference…?

That’s what I thought when I was writing the code too. I think it’s very possible. I felt like the OpenAI CLIP library was doing a lot of things with pickle

I think there were several modified versions of CLIP for ComfyUI on HF, so it might be useful to see if they work, to help isolate the problem.
I mean, your CLIP is also working except for the conversion to HF format…
So maybe they are all doomed to fail under the current version of the HF specifications in the first place…? This is surprisingly common.

A rather large number of people use HF as a simple storage shed, so people don’t really care if it works with the hub or not.
Stable storage is essential for information exchange during AI development, though, so if HF is fine with that, I consider its tolerance a good thing.
But, well, if it works, it must be more fun if it works!

Hmm, that’s a good idea! And if another version works, I can write to the author of the model and ask them how they pulled it off.

Or upload a new HF in which I upload everything like OpenAI AI did, with a pytorch_model.bin - and then hope I also get a SFconvertbot Adding safetensors variant of this model (#19) 32bd642 :wink:

Meanwhile, I have been looking into using GPT4o with o1preview for TorchScript.

It’s complicated by them not liking ‘attacks’ on the model. And special tokens do count as attacks. Unfortunately, both GPT and CLIP have the same ones…

First, I got a warning for copypasting the CLIP code, including <|startoftext|> and <|endoftext|>.

Eventually, the AI realized we need to add the tokenizer as just “import clip” is incompatible with TorchScript, so the AI predicted a multi-token version of its special tokens… Aaaand that also ‘triggered’ OpenAI. :rofl:

Welcome to MadHouse. Oh well. Happy Friday! :upside_down_face:

It’s Saturday over here.:hugs:

I can write to the author of the model and ask them how they pulled it off.

New Discussion can be used as a de facto BBS, and you can send mentions (@+username) to the author.

First, I got a warning for copypasting the CLIP code, including <|startoftext|> and <|endoftext|>.

Oh, so they think it’s a jailbreak or prompt injection.

1 Like

I decided that it would be quicker if I actually got it working, so I took the HF staff’s source and made a demo.
It seems to be working rather well…?
It’s not so much that cats and dogs are reversed.

1 Like

That was actually super helpful, what a great idea to set up a space - thank you!
I have no idea how they are loading it in ‘spaces’ exactly to cause this difference, but - it seems like something is off with the logit_scale.

Logit scale is ‘2.6592’ for original CLIP, and my model has ‘4.6052’ - internally, in the model’s parameter, as it’s a learnable parameter in CLIP.

I also changed it in my config.json on HF now, but it doesn’t seem to make a difference when Transformers library is handling it.

In fact, I just manipulated logit_scale after loading to truly ridiculous values, and it did nothing:

Parameter containing:
tensor(4.6052, requires_grad=True) # as loaded
Parameter containing:
tensor([100.], requires_grad=True) # after manipulating
Cosine similarity (image vs 'A photo of a cat'): 0.054619330912828445
Cosine similarity (image vs 'A picture of a dog'): 0.030599746853113174
Cosine similarity (image vs 'cat'): -0.0010263854637742043
Cosine similarity (image vs 'dog'): 0.004391679540276527


Parameter containing:
tensor(4.6052, requires_grad=True)
Parameter containing:
tensor([1.0000e-04], requires_grad=True)
Cosine similarity (image vs 'A photo of a cat'): 0.054619330912828445
Cosine similarity (image vs 'A picture of a dog'): 0.030599746853113174
Cosine similarity (image vs 'cat'): -0.0010263854637742043
Cosine similarity (image vs 'dog'): 0.004391679540276527

Maybe they are just “internally manipulating logit scale” to OpenAI’s default in the Transformers library, or something? Because it certainly seems to be ignored entirely. No matter if in the config.json or in the model’s parameter.

1 Like

Hi,

The original CLIP format can be converted to a HF model using the conversion script: transformers/src/transformers/models/clip/convert_clip_original_pytorch_to_hf.py at main · huggingface/transformers · GitHub.

2 Likes

The original CLIP format can be converted to a HF model using the conversion script:

Oops, that was a blind spot.:sweat_smile:
The CLIPTextModel in Diffusers is almost in its original format (maybe for Stable Diffusion), but the CLIPModel in Transformers not so much…
Anyway, this solves the problem!

1 Like

Oh - you’ve actually fixed that code! D’oh. I should’ve checked that for changes. I still had the old version that would throw an assertion error, so you’d have to downgrade transformers - and the output was a .bin file. So I dismissed that code as ‘doesn’t work’. :see_no_evil:

Thank you so much!

And indeed:

And local .safetensors, same:

Cosine Similarities with Bad Model
tensor([[ 0.0546,  0.0306, -0.0010,  0.0044]], device='cuda:0')

Cosine Similarities with Updated Model
tensor([[0.2085, 0.0863, 0.1850, 0.0947]], device='cuda:0')

Looking at what happened:

Difference found in key: vision_model.encoder.layers.22.layer_norm2.bias
Bad Model tensor: tensor([0., 0., 0.,  ..., 0., 0., 0.])
Updated Model tensor: tensor([ 0.1840,  0.3325, -0.2294,  ..., -0.1087,  0.5931, -0.2810])

Difference found in key: text_model.encoder.layers.5.layer_norm1.weight
Bad Model tensor: tensor([1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.,
        1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.,
Updated Model tensor: tensor([1.4388, 1.5490, 1.4443, 1.4368, 1.4696, 1.5982, 1.4797, 1.4962, 1.5059,
        1.4201, 1.5555, 1.4057, 1.4634, 1.5500, 1.3880, 1.4897, 1.4536, 1.6242,

Basically, I turned the entire Layer Normalization into a no-op with my failure of converting the model. :sweat_smile:

Thanks to you both for the help, @nielsr and @John6666 :hugs:

1 Like

Yay.
There are many useful resources in the github folders of transformers and diffusers, so I need to get into the habit of looking into them first… this is a good reminder!

1 Like

This topic was automatically closed 12 hours after the last reply. New replies are no longer allowed.