Simultaneous processing of multi-queries to the LLM model

DimmNet · November 14, 2023, 8:10am

I’m trying to keep the LLM model loaded in memory and organize queue-less request processing. I want the model to process several requests simultaneously, up to 10 requests every second. I realize that this requires a very powerful server. But the main question is: is it possible? Does any model support multithreading? Are there any open source models, like Mistral from OpenOrca, that support processing multiple requests simultaneously? I will be grateful for any advice!

Kv-boii · July 4, 2024, 8:34am

Multi threading is definitely possible, I’m rn working on multithreading my rag model to reduce the response time, for now no frameworks (like langchain, llamaindex) support it within its structure, but its definitely possible through external libraries. As for models, I dont think the model docs have anything deliberately said about multithreading so its trial and error

Topic		Replies	Views
How to multithread my RAG Model? Models	0	148	July 4, 2024
Execute finetuned QA model in parallel Intermediate	0	318	September 29, 2022
Using multiple CPU threads to run LLM model Beginners	1	5208	June 13, 2023
Unable to Process Concurrent User Request Models	1	360	December 1, 2020
Deploying LLM in Production: Performance Degradation with Multiple Users 🤗Transformers	6	4860	June 7, 2024

Simultaneous processing of multi-queries to the LLM model

Related topics