Hi, I’m seeking help with an issue we’re experiencing when deploying our Large Language Model (LLM) in production. We’ve set up our model to serve requests via Flask RESTful API, streaming the results to clients. The model is instantiated once and reused across multiple threads to handle multiple r…

Deploying LLM in Production: Performance Degradation with Multiple Users

aidev24 June 7, 2024, 10:40am 7

Did you find a solution for this?

Topic		Replies	Views
Deploying inference model size and performance 🤗Transformers	6	5223	July 9, 2024
Simultaneous processing of multi-queries to the LLM model Models	1	2615	July 4, 2024
Using multiple CPU threads to run LLM model Beginners	1	5190	June 13, 2023
How to deploy larger model inference on multiple machine with multiple GPU？ 🤗Transformers	1	2607	December 19, 2023
Offloading LLM models to CPU uses only single core 🤗Transformers	1	4032	June 3, 2024