Converting weights to .safetensors with HF format -> CLIP-L is ruined. Why?

zer0int · September 18, 2024, 7:51pm

I have fine-tuned openai/clip-vit-large-patch14 → https://huggingface.co/zer0int/CLIP-GmP-ViT-L-14/

I finally figured out you need metadata in the model and all, and it ‘works’ (as in, my model loads):

from transformers import CLIPProcessor, CLIPModel

model_id = "zer0int/CLIP-GmP-ViT-L-14"

model = CLIPModel.from_pretrained(model_id)
processor = CLIPProcessor.from_pretrained(model_id)

But, comparing:

CLIPModel.from_pretrained("openai/clip-vit-large-patch14")


Cosine similarity (image vs 'A photo of a cat'): 0.2330581396818161
Cosine similarity (image vs 'A picture of a dog'): 0.15255104005336761
Cosine similarity (image vs 'cat'): 0.21000739932060242
Cosine similarity (image vs 'dog'): 0.14514459669589996

I have re-converted the original model to HuggingFace model.safetensors, from my original torch.save pickle file (fine-tuned with “import clip”), and using the original OpenAI/CLIP as a ‘donor’ for missing ‘position_ids’ as well as the ‘syntax inspiration’. All keys match. logit_scale matches. Still, when I load my model, I always something along the lines of:

CLIPModel.from_pretrained("zer0int/CLIP-GmP-ViT-L-14")

Image vs 'A photo of a cat': 0.05461934581398964
Image vs 'A picture of a dog': 0.030599746853113174
Image vs 'cat': -0.0010263863950967789
Image vs 'dog': 0.004391679540276527

I have no idea what is going on with that messed up cosine similarity. I am obviously doing something wrong.

And, to make sure, I am loading the exact same model file I have used for conversion to .safetensors - but this time, I am using the pytorch-pickle.pt file. Loading the exact same model, just not in huggingface format. And I get:

Cosine similarity (image vs 'A photo of a cat'): 0.2086181640625
Cosine similarity (image vs 'A picture of a dog'): 0.08636474609375
Cosine similarity (image vs 'cat'): 0.1849365234375
Cosine similarity (image vs 'dog'): 0.0947265625

This is absolutely as expected. Slightly less confident than original CLIP about this being a “cat” - but absolutely SUPER confident that this is NOT a dog.
That re-organization of embeddings is why my model outperforms the original one.

No idea what I am doing wrong.
I just “stole” the original config etc. from “openai/clip-vit-large-patch14”. I did NOT change / re-train the tokenizer. My model is just a ‘normal’ CLIP ViT-L/14, fine-tuned.

I saw there was a “SFconvertbot” that apparently created the .safetensors file for OpenAI’s original model. Do I just have to upload one model separately, as a pickle (.bin?), and your bot will come by and fix this? =)

Any help is appreciated - from bot or human alike! Thank you!

John6666 · September 18, 2024, 10:10pm

I don’t know if this is related, but if, for example, adjustments are made in ComfyUI, the key names may have changed on their own. As shown below.
As for the rest, there should be no difference in structure between Diffusers and the rest of the other components, aside from UNET and VAE…

text_model.encoder.layers.0.layer_norm2.weight => text_encoders.clip_l.transformer.text_model.encoder.layers.0.layer_norm2.weight

zer0int · September 19, 2024, 5:32am

Thank you for your response!
Yes, it fortunately seems like just the Text Encoder of CLIP works fine as-is in HuggingFace Safetensors format. ComfyUI also handles a state_dict.pt in original OpenAI “import clip” format (naming) and converts it appropriately, so it can take either .safetensors or any pickle format just fine - and it seems to produce the same results. While e.g. Forge and some gguf / quantization stuff people were exploring apparently needs everything to be in the HF format, I was told (I only use ComfyUI).

So, it seems like it’s a specific issue with trying to make my full model - text and vision - work with the transformers library. I must be doing something wrong during the conversion, and I can’t figure out what it is.

John6666 · September 19, 2024, 6:53am

AI-related things are advancing so fast that the specifications are not set in stone, so to speak, they are all like proprietary specifications, so the compatibility issue is interesting or worth looking into.
There were no errors in reading or writing these safetensors files, and there should be no formatting inconsistencies for the CLIP model when it can be from_pretrained and save_pretrained. That one gives errors quite severely.
There were also no unwanted extras given when the model was packaged in ComfyUI.
Also, if there are parts that are manually converted, it is possible that a bug occurred there, but I don’t think that would be enough to cause a malfunction…

Now all we can do is to play the game of looking for mistakes…

FLUX Original CLIP keys

CLIPs from your repo keys

Test Script

import torch
from safetensors.torch import load_file, save_file
from pathlib import Path

def normalize_key(k: str):
    return k.replace("vae.", "").replace("model.diffusion_model.", "")\
        .replace("text_encoders.clip_l.transformer.", "")\
        .replace("text_encoders.t5xxl.transformer.", "")

filename = "model_original.safetensors"
savename = Path(filename).stem + "_fixed" + Path(filename).suffix
oldlogname = Path(filename).stem + "_fixed" + ".old_keys.txt"
newlogname = Path(filename).stem + "_fixed" + ".new_keys.txt"

state_dict = load_file(filename)
new_sd = dict()

keys_old = []
keys_new = []
with torch.no_grad():
    try:
        for k, v in state_dict.items():
            nkey = normalize_key(k)
            print(f"{k} => {nkey}")
            keys_old.append(k)
            keys_new.append(nkey)
            new_sd[k] = v
    except Exception as e:
        print(e)

#save_file(new_sd, savename)

with open(oldlogname, encoding='utf-8', mode='w') as f:
    f.write("\n".join(keys_old))
with open(newlogname, encoding='utf-8', mode='w') as f:
    f.write("\n".join(keys_new))

Speaking of which, do you think this should eventually work as a stand-alone CLIP text model?
Or one that is integrated into SDXL or Flux?

P.S.

If we can load it with CLIPTextModel, we can handle the rest.
But the loading itself is already done in the first place…it’s just the behavior is weird.
I looked at the create_diffusers_clip_model_from_ldm function, but it just discards unused parts of the CLIP model and fixes prefixes.

from diffusers.loaders.single_file_utils import create_diffusers_clip_model_from_ldm
from transformers import CLIPTextModel

# CLIP
clip_ = create_diffusers_clip_model_from_ldm(
    cls=CLIPTextModel, checkpoint=sd, config="black-forest-labs/FLUX.1-dev", subfolder="text_encoder"
)

github.com/huggingface/diffusers

NF4 Flux params in diffusers

opened 05:41AM - 13 Aug 24 UTC

closed 05:28AM - 15 Aug 24 UTC

sayakpaul

@SunMarc Since the Flux params are quite huge (if we include the text encode…r2, autoencoder, and the diffusion model itself) -- it totals to more than 30GB. https://huggingface.co/lllyasviel/flux1-dev-bnb-nf4 ships a single safetensors file that has the diffusion model in NF4. Now, I was able to get this converted and load it into our `FluxTransformer2DModel`, ~but I am not seeing any size (state dict size) benefits~. I am seeing the size benefits (yay!). ~But loading seems to be not working yet. What am I missing? Will appreciate feedback.~ Here is a detailed rundown of what I have done so far. <details> <summary>convert_nf4_flux.py</summary> ```python """ Utilities adapted from * https://github.com/huggingface/transformers/blob/main/src/transformers/quantizers/quantizer_bnb_4bit.py * https://github.com/huggingface/transformers/blob/main/src/transformers/integrations/bitsandbytes.py """ import torch import bitsandbytes as bnb from transformers.quantizers.quantizers_utils import get_module_from_name import torch.nn as nn from accelerate import init_empty_weights def _replace_with_bnb_linear( model, method="nf4", has_been_replaced=False, ): """ Private method that wraps the recursion for module replacement. Returns the converted model and a boolean that indicates if the conversion has been successfull or not. """ for name, module in model.named_children(): if isinstance(module, nn.Linear): with init_empty_weights(): in_features = module.in_features out_features = module.out_features if method == "llm_int8": model._modules[name] = bnb.nn.Linear8bitLt( in_features, out_features, module.bias is not None, has_fp16_weights=False, threshold=6.0, ) has_been_replaced = True else: model._modules[name] = bnb.nn.Linear4bit( in_features, out_features, module.bias is not None, compute_dtype=torch.bfloat16, compress_statistics=False, quant_type="nf4", ) has_been_replaced = True # Store the module class in case we need to transpose the weight later model._modules[name].source_cls = type(module) # Force requires grad to False to avoid unexpected errors model._modules[name].requires_grad_(False) if len(list(module.children())) > 0: _, has_been_replaced = _replace_with_bnb_linear( module, has_been_replaced=has_been_replaced, ) # Remove the last key for recursion return model, has_been_replaced def check_quantized_param( model, param_name: str, ) -> bool: module, tensor_name = get_module_from_name(model, param_name) if isinstance(module._parameters.get(tensor_name, None), bnb.nn.Params4bit): # Add here check for loaded components' dtypes once serialization is implemented return True elif isinstance(module, bnb.nn.Linear4bit) and tensor_name == "bias": # bias could be loaded by regular set_module_tensor_to_device() from accelerate, # but it would wrongly use uninitialized weight there. return True else: return False def create_quantized_param( model, param_value: "torch.Tensor", param_name: str, target_device: "torch.device", state_dict=None, unexpected_keys=None, pre_quantized=False ): module, tensor_name = get_module_from_name(model, param_name) if tensor_name not in module._parameters: raise ValueError(f"{module} does not have a parameter or a buffer named {tensor_name}.") old_value = getattr(module, tensor_name) if tensor_name == "bias": if param_value is None: new_value = old_value.to(target_device) else: new_value = param_value.to(target_device) new_value = torch.nn.Parameter(new_value, requires_grad=old_value.requires_grad) module._parameters[tensor_name] = new_value return if not isinstance(module._parameters[tensor_name], bnb.nn.Params4bit): raise ValueError("this function only loads `Linear4bit components`") if ( old_value.device == torch.device("meta") and target_device not in ["meta", torch.device("meta")] and param_value is None ): raise ValueError(f"{tensor_name} is on the meta device, we need a `value` to put in on {target_device}.") if pre_quantized: if (param_name + ".quant_state.bitsandbytes__fp4" not in state_dict) and ( param_name + ".quant_state.bitsandbytes__nf4" not in state_dict ): raise ValueError( f"Supplied state dict for {param_name} does not contain `bitsandbytes__*` and possibly other `quantized_stats` components." ) quantized_stats = {} for k, v in state_dict.items(): # `startswith` to counter for edge cases where `param_name` # substring can be present in multiple places in the `state_dict` if param_name + "." in k and k.startswith(param_name): quantized_stats[k] = v if unexpected_keys is not None and k in unexpected_keys: unexpected_keys.remove(k) new_value = bnb.nn.Params4bit.from_prequantized( data=param_value, quantized_stats=quantized_stats, requires_grad=False, device=target_device, ) else: new_value = param_value.to("cpu") kwargs = old_value.__dict__ new_value = bnb.nn.Params4bit(new_value, requires_grad=False, **kwargs).to(target_device) module._parameters[tensor_name] = new_value ``` </details> <details> <summary>generate.py</summary> ```python from huggingface_hub import hf_hub_download from accelerate.utils import set_module_tensor_to_device, compute_module_sizes from accelerate import init_empty_weights from diffusers.loaders.single_file_utils import convert_flux_transformer_checkpoint_to_diffusers from convert_nf4_flux import _replace_with_bnb_linear, create_quantized_param, check_quantized_param from diffusers import FluxTransformer2DModel, FluxPipeline import safetensors.torch import gc import torch dtype = torch.bfloat16 ckpt_path = hf_hub_download("black-forest-labs/flux.1-dev", filename="flux1-dev.safetensors") original_state_dict = safetensors.torch.load_file(ckpt_path) converted_state_dict = convert_flux_transformer_checkpoint_to_diffusers(original_state_dict) del original_state_dict gc.collect() with init_empty_weights(): config = FluxTransformer2DModel.load_config("black-forest-labs/flux.1-dev", subfolder="transformer") model = FluxTransformer2DModel.from_config(config).to(dtype) _replace_with_bnb_linear(model, "nf4") for param_name, param in converted_state_dict.items(): param = param.to(dtype) if not check_quantized_param(model, param_name): set_module_tensor_to_device(model, param_name, device=0, value=param) else: create_quantized_param(model, param, param_name, target_device=0) del converted_state_dict gc.collect() print(compute_module_sizes(model)[""] / 1024 / 1204) pipe = FluxPipeline.from_pretrained("black-forest-labs/flux.1-dev", transformer=model, torch_dtype=dtype) pipe.enable_model_cpu_offload() prompt = "A mystic cat with a sign that says hello world!" image = pipe(prompt, guidance_scale=3.5, num_inference_steps=50, generator=torch.manual_seed(0)).images[0] image.save("flux-nf4-dev.png") model.push_to_hub("sayakpaul/flux.1-dev-nf4") ``` </details> The image generates just fine. ~But not sure why we're not seeing any size benefit here.~ ![image](https://github.com/user-attachments/assets/6c326972-fc2c-43bf-a7aa-01fcbc393552) ~But the loading seems broken (generated image is noise). Advise?~ I have uploaded the NF4 serialized state dict here: https://huggingface.co/sayakpaul/flux.1-dev-nf4 Loading script is below: <details> <summary>load_from_nf4_and_generate.py</summary> ```python """ Some bits are from https://github.com/huggingface/transformers/blob/main/src/transformers/modeling_utils.py """ from huggingface_hub import hf_hub_download from accelerate.utils import set_module_tensor_to_device, compute_module_sizes from accelerate import init_empty_weights from convert_nf4_flux import _replace_with_bnb_linear, create_quantized_param, check_quantized_param from diffusers import FluxTransformer2DModel, FluxPipeline import safetensors.torch import gc import torch dtype = torch.bfloat16 is_torch_e4m3fn_available = hasattr(torch, "float8_e4m3fn") ckpt_path = hf_hub_download("sayakpaul/flux.1-dev-nf4", filename="diffusion_pytorch_model.safetensors") original_state_dict = safetensors.torch.load_file(ckpt_path) with init_empty_weights(): config = FluxTransformer2DModel.load_config("sayakpaul/flux.1-dev-nf4") model = FluxTransformer2DModel.from_config(config).to(dtype) expected_state_dict_keys = list(model.state_dict().keys()) _replace_with_bnb_linear(model, "nf4") for param_name, param in original_state_dict.items(): if param_name not in expected_state_dict_keys: continue is_param_float8_e4m3fn = is_torch_e4m3fn_available and param.dtype == torch.float8_e4m3fn if torch.is_floating_point(param) and not is_param_float8_e4m3fn: param = param.to(dtype) if not check_quantized_param(model, param_name): set_module_tensor_to_device(model, param_name, device=0, value=param) else: create_quantized_param( model, param, param_name, target_device=0, state_dict=original_state_dict, pre_quantized=True ) del original_state_dict gc.collect() print(compute_module_sizes(model)[""] / 1024 / 1204) pipe = FluxPipeline.from_pretrained("black-forest-labs/flux.1-dev", transformer=model, torch_dtype=dtype) pipe.enable_model_cpu_offload() prompt = "A mystic cat with a sign that says hello world!" image = pipe(prompt, guidance_scale=3.5, num_inference_steps=50, generator=torch.manual_seed(0)).images[0] image.save("flux-nf4-dev-loaded.png") ``` </details> NF4 serialization and loading is working fine!

github.com

huggingface/diffusers/blob/main/src/diffusers/loaders/single_file_utils.py

# coding=utf-8
# Copyright 2024 The HuggingFace Inc. team.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
#     http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
"""Conversion script for the Stable Diffusion checkpoints."""

import os
import re
from contextlib import nullcontext
from io import BytesIO

This file has been truncated. show original

P.S.

Maybe we have to use CLIPTextModel instead of CLIPModel for compatibility?

zer0int · September 19, 2024, 10:06am

Using it as a Text Encoder for a text-to-image generative AI should indeed be fine (if you download the respective files, and don’t use the transformers library to import the ‘broken’ ‘model.safetensors’ from my HF).

But it does work as a stand-alone model. At least as long as you’re not using the HuggingFace transformers .safetensors. If you dare to load the state_dict.pt version into a ViT-L/14 with “import clip”, you’ll see it has better performance (compared to OpenAI pre-trained) on tasks like zero-shot MVT ImageNet/ObjectNet or VoC-2007-multilabel:

edited, can’t post more than one image as n00b

It also has a different attention (more robust on the actual object):

edited, can’t post more than one image as n00b

…And that, in turn, makes it more robust to e.g. adversarial / typographic attack images (the classic, where text is in the image, and CLIP ‘obsesses’ about the text and fails to see the rest); this is a Long-CLIP, but fine-tuned in the same manner (Geometric Parametrization); it is much more confident about this really being an ‘apple’, and not an ‘ipod’. More so, when predicting its own ‘opinion’ (gradient ascent to optimize text embeddings for cosine similarity with given image embeddings), it seems to frequently “see what you did there”, calling it a “productfakespeare” or “absurd hoax apple”, or a “fakepods apple”. :-))

All three above mentioned images merged into one:

PS: Code for all of that (gradient ascent, fine-tune, and more) is here, if you’re interested: github.com/zer0int.

I’ll have to look into the rest you’ve provided in a few hours / in the afternoon, when I have more time; thanks in advance!

John6666 · September 19, 2024, 10:15am

But it does work as a stand-alone model. At least as long as you’re not using the HuggingFace transformers .safetensors. If you dare to load the state_dict.pt version into a ViT-L/14 with “import clip”,

Perhaps you have left out this procedure?

sd = torch.load("yourmodel.pth") # or safetensors.load_file
model = CLIPModel() # maybe it needs some config file actually
model.from_state_dict(sd)
model.save_pretrained("your_hfmodel_path")

zer0int · September 19, 2024, 6:59pm

With this:

import clip
import torch
from transformers import CLIPModel, CLIPConfig
from safetensors.torch import save_file

# Load your fine-tuned OpenAI CLIP model from the torch pickle
openai_model_path = "full-ViT-L-ft.pt"
openai_model, _ = clip.load(openai_model_path, device="cpu")

# Load the state_dict from the OpenAI model
state_dict = openai_model.state_dict()

# Define the Hugging Face CLIPConfig (assuming it's ViT-L/14)
config = CLIPConfig.from_pretrained("openai/clip-vit-large-patch14")
hf_model = CLIPModel(config)

# Map the OpenAI CLIP state_dict to Hugging Face CLIPModel state_dict
hf_model.load_state_dict(state_dict, strict=False)  # strict=False allows for differences in metadata

missing_keys, unexpected_keys = hf_model.load_state_dict(state_dict, strict=False)
print(f"Missing keys: {missing_keys}")
print(f"Unexpected keys: {unexpected_keys}")

save_file(hf_model.state_dict(), "fromraw-converted_hf_model.safetensors")

I now have a bunch of new numbers. Very, very wrong numbers, but they’re different!

Image vs 'A photo of a cat': 0.02061634138226509
Image vs 'A picture of a dog': 0.015210384503006935
Image vs 'cat': 0.002116560935974121
Image vs 'dog': 0.027614593505859375

John6666 · September 19, 2024, 10:12pm

How about this!

#import clip
import torch
import os
from transformers import CLIPModel, CLIPConfig
from safetensors.torch import save_file, load_file
from huggingface_hub import save_torch_model

# Load your fine-tuned OpenAI CLIP model from the torch pickle
openai_model_path = "ViT-L-14-GmP-ft.safetensors"
#openai_model, _ = clip.load(openai_model_path, device="cpu")

# Load the state_dict from the OpenAI model
#state_dict = openai_model.state_dict()
state_dict = load_file(openai_model_path, device="cpu")

# Define the Hugging Face CLIPConfig (assuming it's ViT-L/14)
config = CLIPConfig.from_pretrained("openai/clip-vit-large-patch14")
hf_model = CLIPModel(config)

# Map the OpenAI CLIP state_dict to Hugging Face CLIPModel state_dict
hf_model.load_state_dict(state_dict, strict=False)  # strict=False allows for differences in metadata

#missing_keys, unexpected_keys = hf_model.load_state_dict(state_dict, strict=False)
#print(f"Missing keys: {missing_keys}")
#print(f"Unexpected keys: {unexpected_keys}")

#save_file(hf_model.state_dict(), "fromraw-converted_hf_model.safetensors")
save_dir = "fromraw-converted_hf_model"
os.makedirs(save_dir)
save_torch_model(model=hf_model, save_directory=save_dir)
# https://huggingface.co/docs/huggingface_hub/v0.25.0/package_reference/serialization

zer0int · September 20, 2024, 6:34am

Nope. Still the same erratic numbers; I tried from various starting points:

Convert without changes (load .pt, save as .safetensors, use your script - no prior conversion to HF naming scheme)
Try previously converted-to-huggingface .safetensors → use your script.

Same difference; erratic cos sim of ~0 (.3f or .4f, as before).

I thought about it. Actually, there is one difference in OpenAI’s original ViT-L-14.pt vs. my fine-tune… Mine is a simple torch.save, while OpenAI’s is a TorchScript / jit. Maybe that makes a difference…?

PS: Thank you very much your perseverance, I really appreciate it!

John6666 · September 20, 2024, 8:09am

Well, conversion is something of my hobby.

there is one difference in OpenAI’s original ViT-L-14.pt vs. my fine-tune… Mine is a simple torch.save, while OpenAI’s is a TorchScript / jit. Maybe that makes a difference…?

That’s what I thought when I was writing the code too. I think it’s very possible. I felt like the OpenAI CLIP library was doing a lot of things with pickle…

I think there were several modified versions of CLIP for ComfyUI on HF, so it might be useful to see if they work, to help isolate the problem.
I mean, your CLIP is also working except for the conversion to HF format…
So maybe they are all doomed to fail under the current version of the HF specifications in the first place…? This is surprisingly common.

A rather large number of people use HF as a simple storage shed, so people don’t really care if it works with the hub or not.
Stable storage is essential for information exchange during AI development, though, so if HF is fine with that, I consider its tolerance a good thing.
But, well, if it works, it must be more fun if it works!

zer0int · September 20, 2024, 6:04pm

Hmm, that’s a good idea! And if another version works, I can write to the author of the model and ask them how they pulled it off.

Or upload a new HF in which I upload everything like OpenAI AI did, with a pytorch_model.bin - and then hope I also get a SFconvertbot Adding safetensors variant of this model (#19) 32bd642

Meanwhile, I have been looking into using GPT4o with o1preview for TorchScript.

It’s complicated by them not liking ‘attacks’ on the model. And special tokens do count as attacks. Unfortunately, both GPT and CLIP have the same ones…

First, I got a warning for copypasting the CLIP code, including <|startoftext|> and <|endoftext|>.

Eventually, the AI realized we need to add the tokenizer as just “import clip” is incompatible with TorchScript, so the AI predicted a multi-token version of its special tokens… Aaaand that also ‘triggered’ OpenAI.

Welcome to MadHouse. Oh well. Happy Friday!

John6666 · September 20, 2024, 11:46pm

It’s Saturday over here.

I can write to the author of the model and ask them how they pulled it off.

New Discussion can be used as a de facto BBS, and you can send mentions (@+username) to the author.

First, I got a warning for copypasting the CLIP code, including <|startoftext|> and <|endoftext|>.

Oh, so they think it’s a jailbreak or prompt injection.

John6666 · September 21, 2024, 5:31am

I decided that it would be quicker if I actually got it working, so I took the HF staff’s source and made a demo.
It seems to be working rather well…?
It’s not so much that cats and dogs are reversed.

zer0int · September 21, 2024, 8:23am

That was actually super helpful, what a great idea to set up a space - thank you!
I have no idea how they are loading it in ‘spaces’ exactly to cause this difference, but - it seems like something is off with the logit_scale.

Logit scale is ‘2.6592’ for original CLIP, and my model has ‘4.6052’ - internally, in the model’s parameter, as it’s a learnable parameter in CLIP.

I also changed it in my config.json on HF now, but it doesn’t seem to make a difference when Transformers library is handling it.

In fact, I just manipulated logit_scale after loading to truly ridiculous values, and it did nothing:

Parameter containing:
tensor(4.6052, requires_grad=True) # as loaded
Parameter containing:
tensor([100.], requires_grad=True) # after manipulating
Cosine similarity (image vs 'A photo of a cat'): 0.054619330912828445
Cosine similarity (image vs 'A picture of a dog'): 0.030599746853113174
Cosine similarity (image vs 'cat'): -0.0010263854637742043
Cosine similarity (image vs 'dog'): 0.004391679540276527


Parameter containing:
tensor(4.6052, requires_grad=True)
Parameter containing:
tensor([1.0000e-04], requires_grad=True)
Cosine similarity (image vs 'A photo of a cat'): 0.054619330912828445
Cosine similarity (image vs 'A picture of a dog'): 0.030599746853113174
Cosine similarity (image vs 'cat'): -0.0010263854637742043
Cosine similarity (image vs 'dog'): 0.004391679540276527

Maybe they are just “internally manipulating logit scale” to OpenAI’s default in the Transformers library, or something? Because it certainly seems to be ignored entirely. No matter if in the config.json or in the model’s parameter.

nielsr · September 21, 2024, 9:12am

Hi,

The original CLIP format can be converted to a HF model using the conversion script: transformers/src/transformers/models/clip/convert_clip_original_pytorch_to_hf.py at main · huggingface/transformers · GitHub.

John6666 · September 21, 2024, 9:29am

The original CLIP format can be converted to a HF model using the conversion script:

Oops, that was a blind spot.
The CLIPTextModel in Diffusers is almost in its original format (maybe for Stable Diffusion), but the CLIPModel in Transformers not so much…
Anyway, this solves the problem!

zer0int · September 21, 2024, 10:41am

Oh - you’ve actually fixed that code! D’oh. I should’ve checked that for changes. I still had the old version that would throw an assertion error, so you’d have to downgrade transformers - and the output was a .bin file. So I dismissed that code as ‘doesn’t work’.

Thank you so much!

And indeed:

And local .safetensors, same:

Cosine Similarities with Bad Model
tensor([[ 0.0546,  0.0306, -0.0010,  0.0044]], device='cuda:0')

Cosine Similarities with Updated Model
tensor([[0.2085, 0.0863, 0.1850, 0.0947]], device='cuda:0')

Looking at what happened:

Difference found in key: vision_model.encoder.layers.22.layer_norm2.bias
Bad Model tensor: tensor([0., 0., 0.,  ..., 0., 0., 0.])
Updated Model tensor: tensor([ 0.1840,  0.3325, -0.2294,  ..., -0.1087,  0.5931, -0.2810])

Difference found in key: text_model.encoder.layers.5.layer_norm1.weight
Bad Model tensor: tensor([1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.,
        1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.,
Updated Model tensor: tensor([1.4388, 1.5490, 1.4443, 1.4368, 1.4696, 1.5982, 1.4797, 1.4962, 1.5059,
        1.4201, 1.5555, 1.4057, 1.4634, 1.5500, 1.3880, 1.4897, 1.4536, 1.6242,

Basically, I turned the entire Layer Normalization into a no-op with my failure of converting the model.

Thanks to you both for the help, @nielsr and @John6666

John6666 · September 21, 2024, 10:45am

Yay.
There are many useful resources in the github folders of transformers and diffusers, so I need to get into the habit of looking into them first… this is a good reminder!

system · September 21, 2024, 10:46pm

This topic was automatically closed 12 hours after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Discrepancy between OpenAI CLIP and Huggingface CLIP models Models	2	1673	August 19, 2024
Understanding where model weights are stored for research project on AI openness Research	3	302	January 31, 2025
CLIPModel finetuning Models	9	9236	July 20, 2022
Hugging Face to GGUF Conversion Broken? 🤗Hub	1	5279	February 11, 2024
Loading a safetensors format model using Hugging Face Transformers 🤗Transformers	2	4734	September 13, 2023

FLUX Original CLIP keys

CLIPs from your repo keys

Test Script

P.S.

P.S.

edited, can’t post more than one image as n00b

edited, can’t post more than one image as n00b

Related topics