valueError: Supplied state dict for layers does not contain `bitsandbytes__*` and possibly other `quantized_stats`(when load saved quantized model)

wichofer · October 9, 2024, 2:54am

We are trying to deploy a quantized Llama 3.1 70B model(from Huggingface, using bitsandbytes), quantizing part works fine as we check the model memory which is correct and also test getting predictions for the model, which is also correct, the problem is: after saving the quantized model and then loading it we get

valueError: Supplied state dict for layers.0.mlp.down_proj.weight does not contain bitsandbytes__* and possibly other quantized_stats components

What we do is:

Save the quantized model using the usual save_pretrained(save_dir)
Try to load the model using AutoModel.from_pretrained, passing the save_dir and the same quantization_config used when creating the model.

Here is the code:

model_id = "meta-llama/Meta-Llama-3.1-70B-Instruct"


cache_dir = "/home/ec2-user/SageMaker/huggingface_cache"

quantization_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.bfloat16,
    bnb_4bit_use_double_quant=True,
)

model_4bit = AutoModelForCausalLM.from_pretrained(
    model_id,
    device_map="auto",
    torch_dtype=torch.bfloat16,
    quantization_config=quantization_config,
    low_cpu_mem_usage=True,
    offload_folder="offload",
    offload_state_dict=True,
    cache_dir=cache_dir
)

tokenizer = AutoTokenizer.from_pretrained(model_id,cache_dir=cache_dir)

pt_save_directory = "test_directory"
tokenizer.save_pretrained(pt_save_directory,)
model_4bit.save_pretrained(pt_save_directory)
## test load it

loaded_model = AutoModel.from_pretrained(pt_save_directory,
                                     quantization_config=quantization_config
                                     )

John6666 · October 10, 2024, 10:01am

typo.

tokenizer.save_pretrained(pt_save_directory,)

to

tokenizer.save_pretrained(pt_save_directory)

schnell18 · October 22, 2024, 12:10pm

This is a bug in the _load_pretrained_model() function of
transformers/modeling_utils.py when loading sharded weight files. The
state_dict is applied to the empty model per shard. This is problematic as the
quantized weight and its meta data(*.quant_state.bitsandbytes__nf4) may be
stored in the different shards. The quick-and-dirty fix is to merge tensors
from all shards into one state_dict. Similar issues have been reported on stackoverflow
and the unsloth github issue 638

ZainZia · May 30, 2025, 3:52pm

Is this issue fixed ?

I am getting the same error

in

import torch
from transformers import (
pipeline,
BitsAndBytesConfig,
)

1) Build your 4-bit config.

bnb_config = BitsAndBytesConfig(
load_in_4bit=True,
llm_int8_enable_fp32_cpu_offload=True, # keeps some weights in FP32 on CPU
bnb_4bit_quant_type=“nf4”, # or “fp4”, “fp4-dq”, etc.
bnb_4bit_compute_dtype=torch.float16, # compute in fp16 on GPU
)

2) Create the pipeline, passing quantization_config:

pipe = pipeline(
“image-text-to-text”,
model=“unsloth/gemma-3-27b-it-unsloth-bnb-4bit”,
cache_dir=“/mnt/models/gemma3”,
trust_remote_code=True,
device_map=“auto”,
quantization_config=bnb_config, # ← here’s the key
)

messages = [
{
“role”: “user”,
“content”: [
{“type”: “image”, “url”: “https://…/candy.JPG”},
{“type”: “text”, “text”: “What animal is on the candy?”}
]
},
]

print(pipe(text=messages))

John6666 · May 30, 2025, 11:37pm

Is this issue fixed ?

Maybe not yet.

Topic		Replies	Views
Error loading tokenizer: data did not match any variant of untagged enum ModelWrapper at line 1251003 column 3 🤗Tokenizers	3	3592	October 10, 2024
Error when quantization codellama 70b Models	3	118	June 20, 2024
Issue in deploying quantized meta-llama/Llama-3.1-8B-Instruct in aws sagemaker Intermediate	0	72	October 10, 2024
An error i ve been trying to fix for days now Intermediate	4	434	November 19, 2024
Error loading Llama model Beginners	5	1566	March 9, 2024

valueError: Supplied state dict for layers does not contain `bitsandbytes__*` and possibly other `quantized_stats`(when load saved quantized model)

1) Build your 4-bit config.

2) Create the pipeline, passing quantization_config:

Related topics