Using Text Generation Inference with LoRA adapter

I had just trained my first LoRA model but I believe that I might have missed something.
After training a Flan-T5-Large model, I tested it and it was working perfectly.
I decided that I wanted to test its deployment using TGI.
I managed to deploy the base Flan-T5-Large model from Google using TGI as it was pretty straightforward. But When I came to test the LoRA model I got using pipeline, the model underperformed heavily.
I noticed that when I trained my LoRA model, I did not get a “config.json” file, I got an “adapter_config.json” file. I understood that what I basically had was only the adapter.
I don’t know if that is one of the reason, as after training I did more research concerning LoRA and I noticed that in the documention they had mentioned “merging” and “loading” between the base model and the LoRA, which I did not do at the start. I basically trained and got several checkpoints for each epoch. Tested the checkpoint that had the best metrics and pushed it to my private hub. These are the files that I have pushed to my hub:

  • gitattributes
  • README.md
  • adapter_config.json
  • adapter_model.safetensors
  • special_tokens_map.json
  • spiece.model
  • tokenizer.json
  • tokenizer_config.json

While trying to avoid re-training, how can I deploy the LoRA model to test properly using Pipeline so that I can also deploy it on TGI?

Hi,

Thanks to the PEFT integration in the Transformers library, the base model + adapter weights will automatically be loaded. The weights of the base model (such as Flan-T5-large in your case) can be loaded since the adapter_config.json contains a base_model_name_or_path key.

TGI for now only supports deploying models trained with LoRa by calling the merge_and_unload method: curious about the plans for supporting PEFT and LoRa. · Issue #482 · huggingface/text-generation-inference · GitHub.

Update, TGI now also supports inference without merging your LoRa into the base model (you could call the model with various hundreds of different LoRa’s!): TGI Multi-LoRA: Deploy Once, Serve 30 Models.

Hi @nielsr ,

I have been trying to finetune the Idefics2 with my custom docvqa dataset. And now i am trying to use TGI for inference and came accross this discussion. can you let me know what exactly is difference in deploying models trained with LoRa by calling the merge_and_unload method, using PeftModel.from_pretrained and using model.add_weighted_adapter?

Thanks

Hi,

If you call merge_and_unload, the LoRa adapters are merged into your base model, so you end up with a model just like the ones available in the Transformers library.

If you use PeftModel.from_pretrained, the base model and LoRa adapters are loaded separately (they are not merged yet). You can view this by doing:

for name, param in model.named_parameters():
    print(name, param.shape)

You will see some Lora_A and Lora_B parameters - so they are still separate, not merged into the parameters of the base model. There’s no need to call model.add_weighted_adapter because the PeftModel.from_pretrained will already load your adapters from a folder/repo on the hub.