I’m seeking help with an issue we’re experiencing when deploying our Large Language Model (LLM) in production. We’ve set up our model to serve requests via Flask RESTful API, streaming the results to clients. The model is instantiated once and reused across multiple threads to handle multiple requests.
Everything works great when we test with a single user, we get the entire generation to complete in 12 secs. The GPU utilization is at 100%, and the responses are fast. However, when we introduce multiple users in load testing (just at 10 concurrent users), the performance significantly degrades. The GPU utilization drops to around 40%, and the response times increase to over 3 min per user.
I’m curious if this performance degradation is expected behavior when scaling up the number of users, or if there’s something we’re missing in our deployment setup/code? Even with the 10 concurrent users the GPU memory stays under 80% of the available.
Can anyone offer insights, recommendations, or suggestions on how to troubleshoot this issue further? Are there any known best practices or optimization techniques we can apply to improve the performance of our LLM deployment?
We couldn’t find any references to any code or deployment description on what are the best practices for what seems to be a very common deployment.
Any guidance you can provide is greatly appreciated.