Deploying LLM in Production: Performance Degradation with Multiple Users

elielevy · August 14, 2023, 6:38pm

Hi,

I’m seeking help with an issue we’re experiencing when deploying our Large Language Model (LLM) in production. We’ve set up our model to serve requests via Flask RESTful API, streaming the results to clients. The model is instantiated once and reused across multiple threads to handle multiple requests.

Everything works great when we test with a single user, we get the entire generation to complete in 12 secs. The GPU utilization is at 100%, and the responses are fast. However, when we introduce multiple users in load testing (just at 10 concurrent users), the performance significantly degrades. The GPU utilization drops to around 40%, and the response times increase to over 3 min per user.

I’m curious if this performance degradation is expected behavior when scaling up the number of users, or if there’s something we’re missing in our deployment setup/code? Even with the 10 concurrent users the GPU memory stays under 80% of the available.

Can anyone offer insights, recommendations, or suggestions on how to troubleshoot this issue further? Are there any known best practices or optimization techniques we can apply to improve the performance of our LLM deployment?

We couldn’t find any references to any code or deployment description on what are the best practices for what seems to be a very common deployment.

Any guidance you can provide is greatly appreciated.

yuval1929 · August 20, 2023, 7:28am

Hi @elielevy, were you able to figure out any info on this? or best practices?

mallapraveen · September 12, 2023, 8:51am

@elielevy can you share any info on how you are able to scale or use multiple threads to handle multiple requests?

If the model is loaded once in memory and you are using threading to handle multiple requests. so, all threads use the same model that is been loaded ryt and how is this possible since model is already in use by the first request.

can you please share your implementation?

kaoutaar · November 5, 2023, 4:41pm

need to know the answer too.

sorasora · November 16, 2023, 8:02am

need to know the answer too.

abhishek-wfx · November 21, 2023, 10:26am

concurrent requests wont work. Use FastAPI and read about batching.

aidev24 · June 7, 2024, 10:40am

Did you find a solution for this?

Topic		Replies	Views
Model Deploy On-prem Beginners	1	789	March 21, 2024
Optimizing LLM Inference with One Base LLM and Multiple LoRA Adapters for Memory Efficiency 🤗Transformers	1	4650	January 20, 2024
How to deploy larger model inference on multiple machine with multiple GPU？ 🤗Transformers	1	2540	December 19, 2023
Simultaneous processing of multi-queries to the LLM model Models	1	2470	July 4, 2024
Having issues with running parallel, independent inferences on multiple GPUs Beginners	0	237	September 10, 2024

Deploying LLM in Production: Performance Degradation with Multiple Users

Related topics