Using Text Generation Inference with LoRA adapter

I had just trained my first LoRA model but I believe that I might have missed something.
After training a Flan-T5-Large model, I tested it and it was working perfectly.
I decided that I wanted to test its deployment using TGI.
I managed to deploy the base Flan-T5-Large model from Google using TGI as it was pretty straightforward. But When I came to test the LoRA model I got using pipeline, the model underperformed heavily.
I noticed that when I trained my LoRA model, I did not get a “config.json” file, I got an “adapter_config.json” file. I understood that what I basically had was only the adapter.
I don’t know if that is one of the reason, as after training I did more research concerning LoRA and I noticed that in the documention they had mentioned “merging” and “loading” between the base model and the LoRA, which I did not do at the start. I basically trained and got several checkpoints for each epoch. Tested the checkpoint that had the best metrics and pushed it to my private hub. These are the files that I have pushed to my hub:

  • gitattributes
  • README.md
  • adapter_config.json
  • adapter_model.safetensors
  • special_tokens_map.json
  • spiece.model
  • tokenizer.json
  • tokenizer_config.json

While trying to avoid re-training, how can I deploy the LoRA model to test properly using Pipeline so that I can also deploy it on TGI?

Hi,

Thanks to the PEFT integration in the Transformers library, the base model + adapter weights will automatically be loaded. The weights of the base model (such as Flan-T5-large in your case) can be loaded since the adapter_config.json contains a base_model_name_or_path key.

TGI for now only supports deploying models trained with LoRa by calling the merge_and_unload method: curious about the plans for supporting PEFT and LoRa. · Issue #482 · huggingface/text-generation-inference · GitHub

Hi @nielsr ,

I have been trying to finetune the Idefics2 with my custom docvqa dataset. And now i am trying to use TGI for inference and came accross this discussion. can you let me know what exactly is difference in deploying models trained with LoRa by calling the merge_and_unload method, using PeftModel.from_pretrained and using model.add_weighted_adapter?

Thanks