Finetuned LLM model conversion to GGUF - performance drop

Hi, I’m finetunning LLM on my data using SFTTrainer, bitsandbytes quatization and peft with configs like listed below. When I convert the model to GGUF for CPU inference, the model performance significantly drops. Any idea what could be a problem?

        bnb_config = BitsAndBytesConfig(
            load_in_4bit=True,
            bnb_4bit_quant_type="nf4",
            bnb_4bit_compute_dtype=torch.float16,
        )

    peft_config = LoraConfig(
        lora_alpha=16,
        lora_dropout=0.1,
        r=16,
        bias="none",
        task_type="CAUSAL_LM"

    )

I do conversion to gguf in the following way. First, merge trained adapter with base model. Then such merged model is converted to gguf using llama.cpp, ‘convert.py’ script, I do q8_0 quantization, tested other types without success. I tested as well conversion using unsloath, as well w/o positive result.

python convert.py <MERGED_MODEL_PATH> \
  --outfile <OUTPUT_MODEL_NAME.gguf> \
  --outtype q8_0 \
  --vocab_dir <ADAPTER_MODEL_PATH>
1 Like

Hello!

I have two concerns regarding your setup.

  1. Are you sure alpha=16 isn’t too much?

  2. When you merge, do you merge with the original non-quantized model or with a quantized version? My concern is that if LoRA is trained with the quantized version but merged with the original model, the model output might not be the same. And when it is quantized to GGUF format it adds on top of that additional error. And the two stacked together might be enough to degrade the output considerably.

Thank you!

2 Likes

That’s because you quantized the model.

Quantization reduces the precision of the model’s parameters from floating-point representations to lower-bit representations.

Lower bit numbers mean using fewer digits to represent a number. It’s like when you count from 1 to 10 instead of 1 to 100. So, lower bit numbers are simpler and take up less space!

1 Like

HI @iamz80, thank you for your insights and sorry for the late response.
ad 1. I tested different configurations of alpha and r - and it is ok. When is a best setup in your opinion? I saw different opinions.
ad 2. When I fine tune a model a load a base model with quantization:

            model = AutoPeftModelForCausalLM.from_pretrained(
                model_path,
                quantization_config=bnb_config,
                device_map={"": 0},
                trust_remote_code=True,
                use_cache=True,
                is_trainable=False,
            )

Then adapter is bf16 or tf32:

    args = TrainingArguments(
        output_dir=OUTPUT_MODEL_DIR,
        num_train_epochs=ft_parameters.EPOCHS,
        per_device_train_batch_size=ft_parameters.BATCH_SIZE,  # 4 --> 2 becouse gpu
        per_device_eval_batch_size=1,
        gradient_accumulation_steps=ft_parameters.ACCUMULATION_STEP,
        gradient_checkpointing=True,
        optim="paged_adamw_32bit",
        logging_steps=10,
        save_strategy="epoch",
        learning_rate=ft_parameters.LEARINIG_RATE,
        bf16=True,
        tf32=True,
        max_grad_norm=0.3,
        warmup_ratio=0.03,
        lr_scheduler_type="constant",
        evaluation_strategy="epoch" if ft_parameters.DO_EVAL else "no",
        report_to="none",
        disable_tqdm=False,  # disable tqdm since with packing values are in correct
        full_determinism=True,
    )

Then when I load a model for merging I do the following I load it in bf16 as I thought it could only help - but your point is that it could deteriorate performance?

model = AutoPeftModelForCausalLM.from_pretrained(
    lora_adapter_dir, 
    device_map=device_map, 
    torch_dtype=torch.bfloat16
    )

model = model.merge_and_unload()

So it you opinion adding “quantization_config=bnb_config,” to from_pretrained when merging could help? But then I will quantize as well adapter which was in higher precision so far…
And probably the best option is to take bigger machine and finetune without any quantization then we avoid double quantization? I tested that quantization at gguf conversion is needed because w/o it the model has endless inference times…

Hi @meowman, if I do a conversion without quantization model is huge and inference times are endless…