Finetuned LLM model conversion to GGUF - performance drop

Hi, I’m finetunning LLM on my data using SFTTrainer, bitsandbytes quatization and peft with configs like listed below. When I convert the model to GGUF for CPU inference, the model performance significantly drops. Any idea what could be a problem?

        bnb_config = BitsAndBytesConfig(
            load_in_4bit=True,
            bnb_4bit_quant_type="nf4",
            bnb_4bit_compute_dtype=torch.float16,
        )

    peft_config = LoraConfig(
        lora_alpha=16,
        lora_dropout=0.1,
        r=16,
        bias="none",
        task_type="CAUSAL_LM"

    )

I do conversion to gguf in the following way. First, merge trained adapter with base model. Then such merged model is converted to gguf using llama.cpp, ‘convert.py’ script, I do q8_0 quantization, tested other types without success. I tested as well conversion using unsloath, as well w/o positive result.

python convert.py <MERGED_MODEL_PATH> \
  --outfile <OUTPUT_MODEL_NAME.gguf> \
  --outtype q8_0 \
  --vocab_dir <ADAPTER_MODEL_PATH>
1 Like

Hello!

I have two concerns regarding your setup.

  1. Are you sure alpha=16 isn’t too much?

  2. When you merge, do you merge with the original non-quantized model or with a quantized version? My concern is that if LoRA is trained with the quantized version but merged with the original model, the model output might not be the same. And when it is quantized to GGUF format it adds on top of that additional error. And the two stacked together might be enough to degrade the output considerably.

Thank you!

2 Likes

That’s because you quantized the model.

Quantization reduces the precision of the model’s parameters from floating-point representations to lower-bit representations.

Lower bit numbers mean using fewer digits to represent a number. It’s like when you count from 1 to 10 instead of 1 to 100. So, lower bit numbers are simpler and take up less space!

1 Like