Inference after QLoRA fine-tuning

lewisbails · August 30, 2023, 8:32am

I’ve fine-tuned a model via QLoRA by following this notebook: Google Colab
And I pushed the adapter weights to the hugging face hub.
When it comes time to predict with the base model+adapters, should I quantize the base model again (given the adapters were trained alongside a frozen quantized base model)?
Or is it valid to load the base model unquantized, attach/merge the adapters as usual, and predict away?


## TRAINING
bnb_config = BitsAndBytesConfig(...)
model = AutoModelForCausalLM.from_pretrained(..., quantization_config=bnb_config)  # fit adapters alongside quantized base model
model = prepare_model_for_kbit_training(model)
peft_config = LoraConfig(...)
trainer = SFTTrainer(model=model, peft_config=peft_config, ...)
trainer.train()
trainer.push_to_hub()  # pushes adapter weights only

## INFERENCE
peft_config = PeftConfig.from_pretrained(<hub_id>)
base_model = AutoModelForCausalLM.from_pretrained(peft_config.base_model_name_or_path, device_map="auto")  # should I quantize here as I did when fitting the adapters with QLoRA?
peft_model = PeftModel.from_pretrained(base_model, <hub_id>)
model = peft_model.merge_and_unload()

sade-adrien · October 3, 2023, 8:43pm

This should help:

datadoktergroen · November 7, 2023, 4:31pm

Unfortunately lots of the content is behind a paywall… Any other sources?

sade-adrien · November 7, 2023, 5:05pm

Apparently, PEFT v.0.6.0 which was released a couple of days ago is now able to to merge LoRA weights with 8bits base models. Should look more into it to make sure it’s done right, but if we have faith it should be fine. Haven’t tried it yet tho

Owos · January 12, 2024, 11:52pm

HI @lewisbails did you find an answer your question? I have the same question right now.

nielsr · January 13, 2024, 10:46am

If you fine-tuned a model using PEFT, then at inference time you can just use the AutoModelForCausalLM class, which will automatically load the base model + adapter weights for you (thanks to the PEFT integration in Transformers).

You can additionally also pass load_in_4bit=True and device_map="auto" in order to do 4-bit inference and automatically place the model on the available GPUs.

I’ve made a notebook that goes over QLoRa fine-tuning of Mistral-7B which also includes an inference section.

mooolie · April 4, 2024, 3:53am

you’re a real one brother.

rohit103 · June 5, 2024, 9:03pm

Hello Niels,

Thank you for sharing.

Did you ever try to compare QLoRA with QA-LoRA in terms of speed and accuracy during the inference time?

I would greatly appreciate sharing your experiences.

Best,
Rohit

Pavansatish · June 7, 2024, 3:11pm

hey can you help me?

There is an issue while running on my local Mac machine. When changing the model name from “microsoft/DialoGPT-small” to “meta-llama/Meta-Llama-3-8B”, the machine is hanging and no response.

GitHub Link

Topic		Replies	Views
LLM2VEC QLora Quantization after merge_and_upload() Beginners	0	133	July 25, 2024
Do I need to dequantization before merging the qlora 🤗Transformers	10	672	October 9, 2024
Using LoRA Adapters Beginners	0	2165	January 24, 2024
Loading an LoRA adapter trained on quantized model on a non-quantized model Intermediate	0	1375	November 7, 2023
`get_peft_model` or `model.add_adapter` Beginners	2	1172	February 17, 2025

Inference after QLoRA fine-tuning

Related topics