Finetuned LLM model conversion to GGUF - performance drop

tprochenka · June 20, 2024, 11:41am

Hi, I’m finetunning LLM on my data using SFTTrainer, bitsandbytes quatization and peft with configs like listed below. When I convert the model to GGUF for CPU inference, the model performance significantly drops. Any idea what could be a problem?

        bnb_config = BitsAndBytesConfig(
            load_in_4bit=True,
            bnb_4bit_quant_type="nf4",
            bnb_4bit_compute_dtype=torch.float16,
        )

    peft_config = LoraConfig(
        lora_alpha=16,
        lora_dropout=0.1,
        r=16,
        bias="none",
        task_type="CAUSAL_LM"

    )

I do conversion to gguf in the following way. First, merge trained adapter with base model. Then such merged model is converted to gguf using llama.cpp, ‘convert.py’ script, I do q8_0 quantization, tested other types without success. I tested as well conversion using unsloath, as well w/o positive result.

python convert.py <MERGED_MODEL_PATH> \
  --outfile <OUTPUT_MODEL_NAME.gguf> \
  --outtype q8_0 \
  --vocab_dir <ADAPTER_MODEL_PATH>

iamz80 · June 21, 2024, 5:49pm

Hello!

I have two concerns regarding your setup.

Are you sure alpha=16 isn’t too much?
When you merge, do you merge with the original non-quantized model or with a quantized version? My concern is that if LoRA is trained with the quantized version but merged with the original model, the model output might not be the same. And when it is quantized to GGUF format it adds on top of that additional error. And the two stacked together might be enough to degrade the output considerably.

Thank you!

meowman · June 26, 2024, 10:01pm

That’s because you quantized the model.

Quantization reduces the precision of the model’s parameters from floating-point representations to lower-bit representations.

Lower bit numbers mean using fewer digits to represent a number. It’s like when you count from 1 to 10 instead of 1 to 100. So, lower bit numbers are simpler and take up less space!

tprochenka · July 31, 2024, 6:11am

HI @iamz80, thank you for your insights and sorry for the late response.
ad 1. I tested different configurations of alpha and r - and it is ok. When is a best setup in your opinion? I saw different opinions.
ad 2. When I fine tune a model a load a base model with quantization:

            model = AutoPeftModelForCausalLM.from_pretrained(
                model_path,
                quantization_config=bnb_config,
                device_map={"": 0},
                trust_remote_code=True,
                use_cache=True,
                is_trainable=False,
            )

Then adapter is bf16 or tf32:

    args = TrainingArguments(
        output_dir=OUTPUT_MODEL_DIR,
        num_train_epochs=ft_parameters.EPOCHS,
        per_device_train_batch_size=ft_parameters.BATCH_SIZE,  # 4 --> 2 becouse gpu
        per_device_eval_batch_size=1,
        gradient_accumulation_steps=ft_parameters.ACCUMULATION_STEP,
        gradient_checkpointing=True,
        optim="paged_adamw_32bit",
        logging_steps=10,
        save_strategy="epoch",
        learning_rate=ft_parameters.LEARINIG_RATE,
        bf16=True,
        tf32=True,
        max_grad_norm=0.3,
        warmup_ratio=0.03,
        lr_scheduler_type="constant",
        evaluation_strategy="epoch" if ft_parameters.DO_EVAL else "no",
        report_to="none",
        disable_tqdm=False,  # disable tqdm since with packing values are in correct
        full_determinism=True,
    )

Then when I load a model for merging I do the following I load it in bf16 as I thought it could only help - but your point is that it could deteriorate performance?

model = AutoPeftModelForCausalLM.from_pretrained(
    lora_adapter_dir, 
    device_map=device_map, 
    torch_dtype=torch.bfloat16
    )

model = model.merge_and_unload()

So it you opinion adding “quantization_config=bnb_config,” to from_pretrained when merging could help? But then I will quantize as well adapter which was in higher precision so far…
And probably the best option is to take bigger machine and finetune without any quantization then we avoid double quantization? I tested that quantization at gguf conversion is needed because w/o it the model has endless inference times…

tprochenka · July 31, 2024, 6:13am

Hi @meowman, if I do a conversion without quantization model is huge and inference times are endless…

Topic		Replies	Views
Regarding GGUF Quantize model Models	0	163	April 30, 2024
Hugging Face to GGUF Conversion Broken? 🤗Hub	1	5255	February 11, 2024
Attempting to unscale FP16 gradients 🤗Transformers	3	7880	June 10, 2024
Model size-quantization tradeoff for local offline inference Intermediate	1	82	February 7, 2025
Peft following bits and bytes seems to have no effect on LLM Intermediate	0	495	January 31, 2024

Finetuned LLM model conversion to GGUF - performance drop

Related topics