LLM Inference hosting issue

I have a fine tuned LLM that needs to be deployed in AWS for inference. I have made an API that is going to take text from users queries and reply with the answer text. Based on this my question is that at a time I need to serve more that one consumers that can use the LLM for generation of text, what is the best way to handle this based on the fact that I have served my LLM on a single AWS instance and I don’t want to make that instance elastic.

Hi @316usman this is not a solution reply but a question instead. can you please tell me how you fine tuned your llm model?

@coreaiteam Thanks for asking, I used an AWS Sagemaker notebook to load the model and my dataset and then used QLora to fine-tune my model and then pushed it to my huggingface.

Actually I have fine-tuned a lot of models for different down stream tasks like for adding some knowledge for the LLM or modifying the tone of the model for marketing purpose those were for a single client’s use and but now I am trying deployment using AWS endpoints thats my clients’ whole team would be using at the same time so I am trying to make APIs so every user would be interacting with the model in their own space.