Load quantized model in memory

fzanartu · October 13, 2023, 1:54am

Hey guys,

I’m looking for some guidance on a particular issue.
I’m not using HF’s trainer, but my own pytorch implementation, the workflow is like:

best_score = 0
for epoch in epochs:
    # Training
    # ...
    if metric > best_score:
        best_model = deepcopy(model.state_dict)

And then, to evaluate the best model:

model.load_state_dict(best_model)

This approach works perfectly when model is defined as AutoModelForSequenceClassification.from_pretrained(modelcp, labels). However, when I define model as AutoModelForSequenceClassification.from_pretrained(modelcp, labels, load_in_8bit=True), I encounter the following error:

RuntimeError: Loading a quantized checkpoint into non-quantized Linear8bitLt is not supported. Please call module.cuda() before module.load_state_dict()

Anyone could have any insights on how to resolve this issue? Cheers!

AngusHuang · December 8, 2023, 2:09am

I have the same issue, anyone could help?

Topic		Replies	Views
Loading quantized model on CPU only 🤗Transformers	6	18673	February 3, 2025
Does load_in_8bit directly load the model in 8bit? (spoliler, do not seem like it) Beginners	0	1487	July 11, 2023
"normal_kernel_cpu" not implemented for 'Char' when trying to import 8-bit model 🤗Transformers	6	1884	February 23, 2025
How to load quantized LLM to CPU only device Intermediate	0	1950	January 28, 2024
Can I load a model fine-tuned with LoRA 4-bit quantization as an 8-bit model? 🤗Hub	0	290	November 27, 2023

Load quantized model in memory

Related topics