Simultaneous processing of multi-queries to the LLM model

I’m trying to keep the LLM model loaded in memory and organize queue-less request processing. I want the model to process several requests simultaneously, up to 10 requests every second. I realize that this requires a very powerful server. But the main question is: is it possible? Does any model support multithreading? Are there any open source models, like Mistral from OpenOrca, that support processing multiple requests simultaneously? I will be grateful for any advice!

1 Like

Multi threading is definitely possible, I’m rn working on multithreading my rag model to reduce the response time, for now no frameworks (like langchain, llamaindex) support it within its structure, but its definitely possible through external libraries. As for models, I dont think the model docs have anything deliberately said about multithreading so its trial and error