Issue with Deploying LoRA-adapted Model on Hugging Face Endpoint

tmestiri · April 9, 2025, 1:27pm

I am encountering issues when trying to deploy my model with LoRA (Low-Rank Adaptation) adapters on Hugging Face Endpoints. Initially, I uploaded only the LoRA adapters to the Hugging Face model hub. However, when I attempt to deploy the model on the Endpoint, it keeps failing.

What I Have Tried:

Pushing LoRA Adapters to Hugging Face Hub:
I successfully uploaded only the LoRA adapters to Hugging Face. They work fine when I load the model locally in my development environment using the appropriate library and configuration.
Merging the Base Model with LoRA Adapters:
In an attempt to ensure compatibility for deployment, I merged the base model with the LoRA adapter and then pushed the merged model to Hugging Face. Unfortunately, the deployment still fails.

John6666 · April 9, 2025, 1:57pm

I’m not sure whether they’re Transformers or Diffusers, but in any case, LoRA is under the jurisdiction of PEFT.
The Endpoint library uses a somewhat old stable version, and many of the libraries related to quantization are not included.

Therefore, it may be easy to fail with QLoRA or new models. Why not try writing the new library version in the requirements.txt?

John6666 · April 9, 2025, 1:58pm

Oh… If that’s the case, then 2 is strange…
I wonder what’s going on. Does it work locally, or is it failing to apply LoRA?

tmestiri · April 9, 2025, 2:17pm

Thanks for the reply!

Just to clarify:

The base model I’m using is: unsloth/Llama-3.2-11B-Vision-Instruct
I fine-tuned it using Unsloth with LoRA adapters
I pushed only the LoRA adapters to Hugging Face initially
The LoRA adapters work perfectly on Colab when I load them with the base model manually
The issue only happens when I try to deploy it as an Endpoint on Hugging Face — it keeps failing to start, either with the LoRA-only upload or even when I later merged the LoRA with the base model and pushed that

So I’m a bit stuck figuring out what’s going wrong during deployment.

Any ideas or suggestions are super appreciated!

John6666 · April 9, 2025, 3:01pm

I fine-tuned it using Unsloth with LoRA adapters

Hugging Face TGI (body of Endpoint) supports only safetensors basically.
So possibly this issue?

tmestiri · April 9, 2025, 3:31pm

I fine-tuned the vision model unsloth/Llama-3.2-11B-Vision-Instruct using Unsloth, and initially pushed only the LoRA adapters to Hugging Face — it worked fine locally on Colab.

Later, I merged the LoRA adapters with the base model and pushed the full model to the Hub. However, the deployment on Hugging Face Endpoints still fails.

Do you suggest that I need to push the merged fine-tuned model in 16-bit .safetensors format for it to work properly with TGI?

Thanks in advance!

John6666 · April 10, 2025, 1:53am

Do you suggest that I need to push the merged fine-tuned model in 16-bit .safetensors format for it to work properly with TGI?

That’s right. There is no particular need for it to be 16-bit, but I personally recommend torch.bfloat16. Otherwise it looks like you need to set the environment variables explicitly to use the Pickle files (.bin). Alternatively, you could use a custom handler or custom container to change the Endpoint software or OS itself, but that would be quite a struggle.

Alternatively, you could use Transformers’ AutoModelForCausalLM.from_pretrained and save_pretrained. Well, it’s the same thing…

tmestiri · April 10, 2025, 8:23am

Thank you for your answer.
So basically, after I fine-tune a vision model using Unsloth , I need to push it back to Hugging Face as a 16-bit model , not just the LoRA adapters.
To ensure compatibility with Hugging Face Endpoints, should I save my fine-tuned vision model in float16 format by selecting merged_16bit and then upload the full model (not just LoRA adapters) using push_to_hub_merged()? This avoids deployment issues, correct?

John6666 · April 10, 2025, 8:57am

should I save my fine-tuned vision model in float16 format by selecting merged_16bit and then upload the full model (not just LoRA adapters) using push_to_hub_merged()? This avoids deployment issues, correct?

Correct. If you save it in the safetensors file format using this method, it will be ideal in terms of compatibility. LoRA alone may work if you set the base model appropriately, but it should be less troublesome if you have the whole thing.

tmestiri · April 10, 2025, 9:14am

Thank you for your answer. i really appreciated , it was helpful

Neobozrim · April 26, 2025, 7:40pm

I had the same issue a couple of days ago and wanted to say thank you to both of you guys - to @tmestiri for sharing his issue (I had absolutely the same case) and for @John6666 for the guidance and support! Thank you both once again!

Topic		Replies	Views
HF hosted interface API for a finetuned model with LoRA 🤗Transformers	2	720	January 15, 2024
Using Text Generation Inference with LoRA adapter Beginners	3	3232	September 2, 2024
Guide/Tutorial to write an inference endpoint for custom models Inference Endpoints on the Hub	5	1946	October 19, 2024
Adding model from HuggingFace to Adapter-hub without training 🤗Hub	0	906	January 25, 2022
Problem in deploying on hugging face hub Models	3	612	June 30, 2024

Issue with Deploying LoRA-adapted Model on Hugging Face Endpoint

Related topics