I am encountering issues when trying to deploy my model with LoRA (Low-Rank Adaptation) adapters on Hugging Face Endpoints. Initially, I uploaded only the LoRA adapters to the Hugging Face model hub. However, when I attempt to deploy the model on the Endpoint, it keeps failing.
What I Have Tried:
Pushing LoRA Adapters to Hugging Face Hub:
I successfully uploaded only the LoRA adapters to Hugging Face. They work fine when I load the model locally in my development environment using the appropriate library and configuration.
Merging the Base Model with LoRA Adapters:
In an attempt to ensure compatibility for deployment, I merged the base model with the LoRA adapter and then pushed the merged model to Hugging Face. Unfortunately, the deployment still fails.
I’m not sure whether they’re Transformers or Diffusers, but in any case, LoRA is under the jurisdiction of PEFT.
The Endpoint library uses a somewhat old stable version, and many of the libraries related to quantization are not included.
Therefore, it may be easy to fail with QLoRA or new models. Why not try writing the new library version in the requirements.txt?
The base model I’m using is: unsloth/Llama-3.2-11B-Vision-Instruct
I fine-tuned it using Unsloth with LoRA adapters
I pushed only the LoRA adapters to Hugging Face initially
The LoRA adapters work perfectly on Colab when I load them with the base model manually
The issue only happens when I try to deploy it as an Endpoint on Hugging Face — it keeps failing to start, either with the LoRA-only upload or even when I later merged the LoRA with the base model and pushed that
So I’m a bit stuck figuring out what’s going wrong during deployment.
I fine-tuned the vision model unsloth/Llama-3.2-11B-Vision-Instruct using Unsloth, and initially pushed only the LoRA adapters to Hugging Face — it worked fine locally on Colab.
Later, I merged the LoRA adapters with the base model and pushed the full model to the Hub. However, the deployment on Hugging Face Endpoints still fails.
Do you suggest that I need to push the merged fine-tuned model in 16-bit .safetensors format for it to work properly with TGI?
Do you suggest that I need to push the merged fine-tuned model in 16-bit .safetensors format for it to work properly with TGI?
That’s right. There is no particular need for it to be 16-bit, but I personally recommend torch.bfloat16. Otherwise it looks like you need to set the environment variables explicitly to use the Pickle files (.bin). Alternatively, you could use a custom handler or custom container to change the Endpoint software or OS itself, but that would be quite a struggle.
Alternatively, you could use Transformers’ AutoModelForCausalLM.from_pretrained and save_pretrained. Well, it’s the same thing…
Thank you for your answer.
So basically, after I fine-tune a vision model using Unsloth , I need to push it back to Hugging Face as a 16-bit model , not just the LoRA adapters.
To ensure compatibility with Hugging Face Endpoints, should I save my fine-tuned vision model in float16 format by selecting merged_16bit and then upload the full model (not just LoRA adapters) using push_to_hub_merged()? This avoids deployment issues, correct?
should I save my fine-tuned vision model in float16 format by selecting merged_16bit and then upload the full model (not just LoRA adapters) using push_to_hub_merged()? This avoids deployment issues, correct?
Correct. If you save it in the safetensors file format using this method, it will be ideal in terms of compatibility. LoRA alone may work if you set the base model appropriately, but it should be less troublesome if you have the whole thing.