Host a Model with vllm for RAG

Hello! I am struggling with an issue here trying to host an LLM using vLLM for performing RAG. When I start the vLLM server with minimal arguments, for example vllm serve --model NousResearch/Hermes-2-Pro-Llama-3-8B, the server starts, but when I send a request, I get an error saying that I should define a template. If I try starting the server with more arguments based on the vLLM documentation: vllm serve --model mistralai/Mistral-7B-Instruct-v0.3 --chat-template examples/tool_chat_template_mistral.jinja --enable-auto-tool-choice --tool-call-parser mistral --gpu-memory-utilization=0.5, I get the error: vllm: error: unrecognized arguments: --enable-auto-tool-choice --tool-call-parser. Can anyone provide any help or advice? Thanks in advance!

I’m not sure vLLM supports the new tool formats, cc @Rocketknight1. They recently merged a PR to support tool calling: [Feature] OpenAI-Compatible Tools API + Streaming for Hermes & Mistral models by K-Mistele · Pull Request #5649 · vllm-project/vllm · GitHub, but I’m not sure it’s based on the chat_template attribute of the tokenizer_config.json.

Tool use recently got unified (included in the chat template attribute): Tool Use, Unified.

Thanks a lot for your answer.! Do you happen to know why, when using Ollama, it’s possible to run larger models like Mixtral 8x7 on a GPU with 20GB, since Ollama manages the parameters well, but when using vLLM and trying to load the same larger models, they can’t be handled by the GPU and give an error related to memory allocation? How is it possible to host the model with Ollama but not with vLLM? Is there a best practice for configuring the engine arguments when starting the vLLM server so that the model fits in the GPU? I tried setting a lower value for --gpu_memory_utilization=0.3, but it still doesn’t work

The reason it works with ollama is because ollama is built on top of llama cpp, which works with so-called quantized models. Quantization is a technique to significantly reduce the size of a model (while keeping the performance degradation minimal). Ollama, like llama cpp, works with models in the GGUF format. If you check mixtral, you’ll see that it uses the Q4_0 quantization by default - shrinking down Mixtral to just 26GB (whereas the model is 96GB in half precision).

vLLM on the other hand works with safetensors or binary PyTorch files. By default, it uses the default precision of the model, e.g. mistralai/Mixtral-8x7B-Instruct-v0.1 · Hugging Face uses bfloat16 (16 bits or 2 bytes per parameter) as default precision - hence you’ll need 96GB. vLLM supports various quantization algorithms, see Supported Hardware for Quantization Kernels — vLLM as well, but you need to enable them with additional flags.

1 Like

You are the best! You’ve really cleared my mind. I didn’t even think that Ollama uses quantized models as well. I thought that was only the case with llama.cpp. That’s why I turned to vLLM, because Mixtral was giving me really good answers for RAG with Ollama, and I thought that if I used a quantized model with llama.cpp, maybe I wouldn’t get the same quality of answers. But in reality, when I was using Ollama, I was also using a quantized model without realizing it… Right? Now that I started the Ollama server, I see exactly the parameter you mentioned: with the Mistral 7B, there is a parameter that says: llm_load_print_meta: model ftype= Q4_0.

When you say to enable the quantization algorithms in vLLM by using additional flags, you mean setting the engine arguments when starting the vLLM server, right? I really appreciate your answer; your explanation was excellent! Thanks again, really!

It seems that in vLLM, you can additionally pass the --quantization flag, see OpenAI Compatible Server — vLLM

Thanks again for your answer. It’s clear you really know your way around open-source models and tools like Ollama. I was wondering if you have any experience or tips on optimizing RAG (Retrieval-Augmented Generation) for production? Specifically, if you have access to resources like 2 GPUs to host an LLM locally, do you think it’s realistic to build a production-ready chatbot using RAG with frameworks like Ollama, llama.cpp, vllm, or any others?

Also, do you think it’s feasible to handle multiple real-time responses—whether for a single domain or even across different domains—using these tools?

Would love to hear your thoughts!