For those still wondering, I found this answer helpful:
In other words, by training with PeFT, you haven’t saved the whole model, but only the parameters that were updated by LoRA. This is called the “adapter model”. In order to run the inference one the whole fine-tuned model, you need to merge the adapter model with the original base model. Here is a guide to do it on your local machine.
However, if you want to run the inference API, you can use a library that was made on purpose, called peft. In order to activate it, you need to specify it on the README of your uploaded model.
Currently there is a problem: if the base model needs an authentication token (e.g. LLama2), it won’t work: this is because the Peft inference API struggles to use your token to fetch the base model. Maybe I am doing something wrong, but in the end I didn’t manage to make it work. If someone finds a solution I would love to know.
Disclaimer: I am not an expert, I am a beginner. Just trying to save some other people the hassle I went through to find it out '-_-