Parallel/ Concurrent request with vLLM

My question might be a bit basic, but I’m new to all of this and eager to learn.

I have a basic setup where I initialize an LLM using vLLM with Langchain RAG and the Llama model (specifically, llama2-13b-chat-hf). Here’s what I do:

  • I define a system prompt and an instruction f
  • I create an llm_chain
  • I then run the chain with llm_chain.run(text) , which works for a single input.

I have build an app with FastAPI. Previously I used asyncio method to handle multiple request to llm, but with each new request it become slower in response. So I decide to use vLLM method, but I got a problem now how to provide parallel or concurrent requests to vLLM when I have dealing with dozen or more users. Is there a way to call run in parallel for several inputs and receive valid results for each input?

1 Like

Hi,
Have you got solutions for it?

Hii, Have you got any solution for the same. I am also facing the same issues.

Hello unfortunately no, so decide to move on without vLLM.