Host a Model with vllm for RAG

nielsr · September 9, 2024, 2:16pm

The reason it works with ollama is because ollama is built on top of llama cpp, which works with so-called quantized models. Quantization is a technique to significantly reduce the size of a model (while keeping the performance degradation minimal). Ollama, like llama cpp, works with models in the GGUF format. If you check mixtral, you’ll see that it uses the Q4_0 quantization by default - shrinking down Mixtral to just 26GB (whereas the model is 96GB in half precision).

vLLM on the other hand works with safetensors or binary PyTorch files. By default, it uses the default precision of the model, e.g. mistralai/Mixtral-8x7B-Instruct-v0.1 · Hugging Face uses bfloat16 (16 bits or 2 bytes per parameter) as default precision - hence you’ll need 96GB. vLLM supports various quantization algorithms, see Supported Hardware for Quantization Kernels — vLLM as well, but you need to enable them with additional flags.

Topic		Replies	Views
Find LLM to run on single gpu with only 8 GB ram Models	10	8457	March 22, 2024
Toolchains for RAG and internet search Beginners	1	404	April 3, 2024
Function Calling and RAG Features Using Open-Source LLMs Intermediate	0	812	December 21, 2023
RAG on HF Inference for Pros - using Llama 2 + Llama 2 embeddings model Models	0	1078	October 28, 2023
Ollama + Llama-3.2-11b-vision-uncensored like 22 Beginners	1	1383	December 10, 2024

Host a Model with vllm for RAG

Related topics