Deployment of finetuned Mistral for Classification and Generation


I have finetuned a Mistral for Text Classification and want to deploy it. I also want to deploy the Mistral Base Model for Generation. Kind of a FastAPI with two endpoints, one for text generation with the mistral base model and one for classification with my finetuned version. But I want to use the same base model and switch between them by turning the adapter on an off, is that possible? Is there an example out there?



So what you’re asking for is Multi-LoRa support? The team is on it: Multi-lora support · Issue #1622 · huggingface/text-generation-inference · GitHub.

TGI (text-generation-inference) is a framework aimed at deployment of LLMs/multimodal models. Besides that, there’s also vLLM which supports multi-lora: Using LoRA adapters — vLLM.

Both TGI and vLLM offer OpenAI API compatibility, which means that you can call the models in the same way as you would call OpenAI models.

Yes, kind off.
I think I wasn’t clear enough.

I have a finetuned AutoModelForSequenceClassification version with the lora task type ‘SEQ_CLS’.
I have a second finetuned AutoModelForCausalLM version with the lora task type ‘CAUSAL_LM’.

I want to create an API with three endpoints:

  1. Mistral base model → generate (no lora)
  2. Finetuned Classification model (lora)
  3. Finetuned generation model (lora)

I want to deploy the base model once and then switch the adapters on/off based on the request. The combination 1. and 3. I think is simple and works. But I am not sure if I can incorporate the classification lora here?

(context, I have a old Server nvidia card with 24GB)

Thank you!

Yes that should work once multi-LoRa is natively supported in TGI.

See also GitHub - predibase/lorax: Multi-LoRA inference server that scales to 1000s of fine-tuned LLMs

This topic was automatically closed 12 hours after the last reply. New replies are no longer allowed.