Inference after QLoRA fine-tuning

nielsr · January 13, 2024, 10:46am

If you fine-tuned a model using PEFT, then at inference time you can just use the AutoModelForCausalLM class, which will automatically load the base model + adapter weights for you (thanks to the PEFT integration in Transformers).

You can additionally also pass load_in_4bit=True and device_map="auto" in order to do 4-bit inference and automatically place the model on the available GPUs.

I’ve made a notebook that goes over QLoRa fine-tuning of Mistral-7B which also includes an inference section.

Topic		Replies	Views
LLM2VEC QLora Quantization after merge_and_upload() Beginners	0	133	July 25, 2024
Do I need to dequantization before merging the qlora 🤗Transformers	10	683	October 9, 2024
Using LoRA Adapters Beginners	0	2174	January 24, 2024
Loading an LoRA adapter trained on quantized model on a non-quantized model Intermediate	0	1382	November 7, 2023
`get_peft_model` or `model.add_adapter` Beginners	2	1199	February 17, 2025

Inference after QLoRA fine-tuning

Related topics