Inference after QLoRA fine-tuning

If you fine-tuned a model using PEFT, then at inference time you can just use the AutoModelForCausalLM class, which will automatically load the base model + adapter weights for you (thanks to the PEFT integration in Transformers).

You can additionally also pass load_in_4bit=True and device_map="auto" in order to do 4-bit inference and automatically place the model on the available GPUs.

I’ve made a notebook that goes over QLoRa fine-tuning of Mistral-7B which also includes an inference section.

5 Likes