I have finetuned a Mistral for Text Classification and want to deploy it. I also want to deploy the Mistral Base Model for Generation. Kind of a FastAPI with two endpoints, one for text generation with the mistral base model and one for classification with my finetuned version. But I want to use the same base model and switch between them by turning the adapter on an off, is that possible? Is there an example out there?
TGI (text-generation-inference) is a framework aimed at deployment of LLMs/multimodal models. Besides that, there’s also vLLM which supports multi-lora: Using LoRA adapters — vLLM.
Both TGI and vLLM offer OpenAI API compatibility, which means that you can call the models in the same way as you would call OpenAI models.
I have a finetuned AutoModelForSequenceClassification version with the lora task type ‘SEQ_CLS’.
I have a second finetuned AutoModelForCausalLM version with the lora task type ‘CAUSAL_LM’.
I want to create an API with three endpoints:
Mistral base model → generate (no lora)
Finetuned Classification model (lora)
Finetuned generation model (lora)
I want to deploy the base model once and then switch the adapters on/off based on the request. The combination 1. and 3. I think is simple and works. But I am not sure if I can incorporate the classification lora here?
(context, I have a old Server nvidia card with 24GB)