Parallel/ Concurrent request with vLLM

Steax93 · July 3, 2024, 3:06pm

My question might be a bit basic, but I’m new to all of this and eager to learn.

I have a basic setup where I initialize an LLM using vLLM with Langchain RAG and the Llama model (specifically, llama2-13b-chat-hf). Here’s what I do:

I define a system prompt and an instruction f
I create an llm_chain
I then run the chain with llm_chain.run(text) , which works for a single input.

I have build an app with FastAPI. Previously I used asyncio method to handle multiple request to llm, but with each new request it become slower in response. So I decide to use vLLM method, but I got a problem now how to provide parallel or concurrent requests to vLLM when I have dealing with dozen or more users. Is there a way to call run in parallel for several inputs and receive valid results for each input?

Magesh78 · July 22, 2024, 7:33am

Hi,
Have you got solutions for it?

ashish12345 · November 27, 2024, 6:13am

Hii, Have you got any solution for the same. I am also facing the same issues.

Steax93 · November 27, 2024, 10:12am

Hello unfortunately no, so decide to move on without vLLM.

Topic		Replies	Views
Host a Model with vllm for RAG Models	6	3626	September 12, 2024
Simultaneous processing of multi-queries to the LLM model Models	1	2530	July 4, 2024
Deploying LLM in Production: Performance Degradation with Multiple Users 🤗Transformers	6	4797	June 7, 2024
How to multithread my RAG Model? Models	0	131	July 4, 2024
The fastest LLM inference on the server Research	0	407	August 8, 2024

Parallel/ Concurrent request with vLLM

Related topics