Using Text Generation Inference with LoRA adapter

WolfAssi285 · February 15, 2024, 9:37am

I had just trained my first LoRA model but I believe that I might have missed something.
After training a Flan-T5-Large model, I tested it and it was working perfectly.
I decided that I wanted to test its deployment using TGI.
I managed to deploy the base Flan-T5-Large model from Google using TGI as it was pretty straightforward. But When I came to test the LoRA model I got using pipeline, the model underperformed heavily.
I noticed that when I trained my LoRA model, I did not get a “config.json” file, I got an “adapter_config.json” file. I understood that what I basically had was only the adapter.
I don’t know if that is one of the reason, as after training I did more research concerning LoRA and I noticed that in the documention they had mentioned “merging” and “loading” between the base model and the LoRA, which I did not do at the start. I basically trained and got several checkpoints for each epoch. Tested the checkpoint that had the best metrics and pushed it to my private hub. These are the files that I have pushed to my hub:

gitattributes
README.md
adapter_config.json
adapter_model.safetensors
special_tokens_map.json
spiece.model
tokenizer.json
tokenizer_config.json

While trying to avoid re-training, how can I deploy the LoRA model to test properly using Pipeline so that I can also deploy it on TGI?

nielsr · May 6, 2024, 12:11pm

Hi,

Thanks to the PEFT integration in the Transformers library, the base model + adapter weights will automatically be loaded. The weights of the base model (such as Flan-T5-large in your case) can be loaded since the adapter_config.json contains a base_model_name_or_path key.

TGI for now only supports deploying models trained with LoRa by calling the merge_and_unload method: https://github.com/huggingface/text-generation-inference/issues/482#issuecomment-1602174068.

Update, TGI now also supports inference without merging your LoRa into the base model:

docs: LoRA (Low-Rank Adaptation)
multi-lora-serving (you could call the model with various hundreds of different LoRa’s!): TGI Multi-LoRA: Deploy Once, Serve 30 Models.

shivavardhineedi · June 13, 2024, 7:07pm

Hi @nielsr ,

I have been trying to finetune the Idefics2 with my custom docvqa dataset. And now i am trying to use TGI for inference and came accross this discussion. can you let me know what exactly is difference in deploying models trained with LoRa by calling the merge_and_unload method, using PeftModel.from_pretrained and using model.add_weighted_adapter?

Thanks

nielsr · September 2, 2024, 8:54am

Hi,

If you call merge_and_unload, the LoRa adapters are merged into your base model, so you end up with a model just like the ones available in the Transformers library.

If you use PeftModel.from_pretrained, the base model and LoRa adapters are loaded separately (they are not merged yet). You can view this by doing:

for name, param in model.named_parameters():
    print(name, param.shape)

You will see some Lora_A and Lora_B parameters - so they are still separate, not merged into the parameters of the base model. There’s no need to call model.add_weighted_adapter because the PeftModel.from_pretrained will already load your adapters from a folder/repo on the hub.

Topic		Replies	Views
Direct Load vs. Base Model + LoRA: How Should You Use It? Models	1	94	March 12, 2025
HF hosted interface API for a finetuned model with LoRA 🤗Transformers	2	703	January 15, 2024
Merged LoRA & text generation inference issues Models	5	2432	November 20, 2023
Adapter-aware chat_template 🤗Tokenizers	3	154	February 21, 2025
Loading Lora models after trainning Beginners	1	3246	June 24, 2024

Using Text Generation Inference with LoRA adapter

Related topics