My question might be a bit basic, but I’m new to all of this and eager to learn.
I have a basic setup where I initialize an LLM using vLLM with Langchain RAG and the Llama model (specifically, llama2-13b-chat-hf). Here’s what I do:
- I define a system prompt and an instruction f
- I create an
llm_chain
- I then run the chain with
llm_chain.run(text)
, which works for a single input.
I have build an app with FastAPI. Previously I used asyncio method to handle multiple request to llm, but with each new request it become slower in response. So I decide to use vLLM method, but I got a problem now how to provide parallel or concurrent requests to vLLM when I have dealing with dozen or more users. Is there a way to call run
in parallel for several inputs and receive valid results for each input?