Hi, I’m finetunning LLM on my data using SFTTrainer, bitsandbytes quatization and peft with configs like listed below. When I convert the model to GGUF for CPU inference, the model performance significantly drops. Any idea what could be a problem?
bnb_config = BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_quant_type="nf4",
bnb_4bit_compute_dtype=torch.float16,
)
peft_config = LoraConfig(
lora_alpha=16,
lora_dropout=0.1,
r=16,
bias="none",
task_type="CAUSAL_LM"
)
I do conversion to gguf in the following way. First, merge trained adapter with base model. Then such merged model is converted to gguf using llama.cpp, ‘convert.py’ script, I do q8_0 quantization, tested other types without success. I tested as well conversion using unsloath, as well w/o positive result.
python convert.py <MERGED_MODEL_PATH> \
--outfile <OUTPUT_MODEL_NAME.gguf> \
--outtype q8_0 \
--vocab_dir <ADAPTER_MODEL_PATH>