Hello,
I have downloaded vicuna 7b (lmsys/vicuna-7b-v1.5 · Hugging Face), and am using HF library to run it locally. Everything runs fine until I need to save the model after quantizing it.
When I load (localy) my model this way, and try to save it locally:
from transformers import AutoTokenizer, AutoModelForCausalLM, Trainer
import quanto
import torch
import time
import os
os.environ["TOKENIZERS_PARALLELISM"] = "false"
quantization_config = QuantoConfig(weights="int8")
tokenizer = AutoTokenizer.from_pretrained("vicuna-7b-v1.5")
model = AutoModelForCausalLM.from_pretrained(
"vicuna-7b-v1.5",
torch_dtype=torch.float32,
quantization_config=quantization_config,
low_cpu_mem_usage=True)
model.save_pretrained(name_to_save="./vicuna-7b-v1.5-quant-8bit")
It casts this error
ValueError: The model is quantized with QuantizationMethod.QUANTO and is not serializable
When I load (locally) and apply 8-bit quantization along the way
from transformers import AutoTokenizer, AutoModelForCausalLM, Trainer
import quanto
import torch
import time
import os
os.environ["TOKENIZERS_PARALLELISM"] = "false"
tokenizer = AutoTokenizer.from_pretrained("vicuna-7b-v1.5")
model = AutoModelForCausalLM.from_pretrained("vicuna-7b-v1.5", low_cpu_mem_usage=True)
quanto.quantize(model, weights=quanto.qint8, activations=None)
quanto.freeze(model)
model.save_pretrained(name_to_save="./vicuna-7b-v1.5-quant-8bit")
I see this error when trying to save it:
ValueError: do_sample is set to False. However, temperature is set to 0.9 – this flag is only used in sample-based generation modes. Set do_sample=True or unset temperature to continue."
Doing what is saying in the messages do not help, and after a few hours around HF forums and stack overflow couldn’t find any solution as well. There are hints from some users, saying that downgrade transformers library fix the 2nd error, but it doesn’t work for me.