Issue with Deploying LoRA-adapted Model on Hugging Face Endpoint

I am encountering issues when trying to deploy my model with LoRA (Low-Rank Adaptation) adapters on Hugging Face Endpoints. Initially, I uploaded only the LoRA adapters to the Hugging Face model hub. However, when I attempt to deploy the model on the Endpoint, it keeps failing.

What I Have Tried:

  1. Pushing LoRA Adapters to Hugging Face Hub:
    I successfully uploaded only the LoRA adapters to Hugging Face. They work fine when I load the model locally in my development environment using the appropriate library and configuration.
  2. Merging the Base Model with LoRA Adapters:
    In an attempt to ensure compatibility for deployment, I merged the base model with the LoRA adapter and then pushed the merged model to Hugging Face. Unfortunately, the deployment still fails.
1 Like

I’m not sure whether they’re Transformers or Diffusers, but in any case, LoRA is under the jurisdiction of PEFT.
The Endpoint library uses a somewhat old stable version, and many of the libraries related to quantization are not included.

Therefore, it may be easy to fail with QLoRA or new models. Why not try writing the new library version in the requirements.txt?

Oh… If that’s the case, then 2 is strange…
I wonder what’s going on. Does it work locally, or is it failing to apply LoRA?

Thanks for the reply!

Just to clarify:

  • The base model I’m using is: unsloth/Llama-3.2-11B-Vision-Instruct
  • I fine-tuned it using Unsloth with LoRA adapters
  • I pushed only the LoRA adapters to Hugging Face initially
  • The LoRA adapters work perfectly on Colab when I load them with the base model manually
  • The issue only happens when I try to deploy it as an Endpoint on Hugging Face — it keeps failing to start, either with the LoRA-only upload or even when I later merged the LoRA with the base model and pushed that

So I’m a bit stuck figuring out what’s going wrong during deployment.

Any ideas or suggestions are super appreciated!

1 Like

I fine-tuned it using Unsloth with LoRA adapters

Hugging Face TGI (body of Endpoint) supports only safetensors basically.
So possibly this issue?

I fine-tuned the vision model unsloth/Llama-3.2-11B-Vision-Instruct using Unsloth, and initially pushed only the LoRA adapters to Hugging Face — it worked fine locally on Colab.

Later, I merged the LoRA adapters with the base model and pushed the full model to the Hub. However, the deployment on Hugging Face Endpoints still fails.

Do you suggest that I need to push the merged fine-tuned model in 16-bit .safetensors format for it to work properly with TGI?

Thanks in advance!

1 Like

Do you suggest that I need to push the merged fine-tuned model in 16-bit .safetensors format for it to work properly with TGI?

That’s right. There is no particular need for it to be 16-bit, but I personally recommend torch.bfloat16. Otherwise it looks like you need to set the environment variables explicitly to use the Pickle files (.bin). Alternatively, you could use a custom handler or custom container to change the Endpoint software or OS itself, but that would be quite a struggle.

Alternatively, you could use Transformers’ AutoModelForCausalLM.from_pretrained and save_pretrained. Well, it’s the same thing…

Thank you for your answer.
So basically, after I fine-tune a vision model using Unsloth , I need to push it back to Hugging Face as a 16-bit model , not just the LoRA adapters.
To ensure compatibility with Hugging Face Endpoints, should I save my fine-tuned vision model in float16 format by selecting merged_16bit and then upload the full model (not just LoRA adapters) using push_to_hub_merged()? This avoids deployment issues, correct?

1 Like

should I save my fine-tuned vision model in float16 format by selecting merged_16bit and then upload the full model (not just LoRA adapters) using push_to_hub_merged()? This avoids deployment issues, correct?

Correct. If you save it in the safetensors file format using this method, it will be ideal in terms of compatibility. LoRA alone may work if you set the base model appropriately, but it should be less troublesome if you have the whole thing.

Thank you for your answer. i really appreciated , it was helpful

1 Like