Inference after QLoRA fine-tuning

I’ve fine-tuned a model via QLoRA by following this notebook: Google Colab
And I pushed the adapter weights to the hugging face hub.
When it comes time to predict with the base model+adapters, should I quantize the base model again (given the adapters were trained alongside a frozen quantized base model)?
Or is it valid to load the base model unquantized, attach/merge the adapters as usual, and predict away?

bnb_config = BitsAndBytesConfig(...)
model = AutoModelForCausalLM.from_pretrained(..., quantization_config=bnb_config)  # fit adapters alongside quantized base model
model = prepare_model_for_kbit_training(model)
peft_config = LoraConfig(...)
trainer = SFTTrainer(model=model, peft_config=peft_config, ...)
trainer.push_to_hub()  # pushes adapter weights only

peft_config = PeftConfig.from_pretrained(<hub_id>)
base_model = AutoModelForCausalLM.from_pretrained(peft_config.base_model_name_or_path, device_map="auto")  # should I quantize here as I did when fitting the adapters with QLoRA?
peft_model = PeftModel.from_pretrained(base_model, <hub_id>)
model = peft_model.merge_and_unload()
1 Like

This should help:

Unfortunately lots of the content is behind a paywall… Any other sources?

Apparently, PEFT v.0.6.0 which was released a couple of days ago is now able to to merge LoRA weights with 8bits base models. Should look more into it to make sure it’s done right, but if we have faith it should be fine. Haven’t tried it yet tho

1 Like

HI @lewisbails did you find an answer your question? I have the same question right now.

If you fine-tuned a model using PEFT, then at inference time you can just use the AutoModelForCausalLM class, which will automatically load the base model + adapter weights for you (thanks to the PEFT integration in Transformers).

You can additionally also pass load_in_4bit=True and device_map="auto" in order to do 4-bit inference and automatically place the model on the available GPUs.

I’ve made a notebook that goes over QLoRa fine-tuning of Mistral-7B which also includes an inference section.


you’re a real one brother. :grin:

Hello Niels,

Thank you for sharing.

Did you ever try to compare QLoRA with QA-LoRA in terms of speed and accuracy during the inference time?

I would greatly appreciate sharing your experiences.


hey can you help me?

There is an issue while running on my local Mac machine. When changing the model name from “microsoft/DialoGPT-small” to “meta-llama/Meta-Llama-3-8B”, the machine is hanging and no response.

GitHub Link