If you fine-tuned a model using PEFT, then at inference time you can just use the AutoModelForCausalLM class, which will automatically load the base model + adapter weights for you (thanks to the PEFT integration in Transformers).
You can additionally also pass load_in_4bit=True
and device_map="auto"
in order to do 4-bit inference and automatically place the model on the available GPUs.
I’ve made a notebook that goes over QLoRa fine-tuning of Mistral-7B which also includes an inference section.