The reason it works with ollama is because ollama is built on top of llama cpp, which works with so-called quantized models. Quantization is a technique to significantly reduce the size of a model (while keeping the performance degradation minimal). Ollama, like llama cpp, works with models in the GGUF format. If you check mixtral, you’ll see that it uses the Q4_0 quantization by default - shrinking down Mixtral to just 26GB (whereas the model is 96GB in half precision).
vLLM on the other hand works with safetensors or binary PyTorch files. By default, it uses the default precision of the model, e.g. mistralai/Mixtral-8x7B-Instruct-v0.1 · Hugging Face uses bfloat16 (16 bits or 2 bytes per parameter) as default precision - hence you’ll need 96GB. vLLM supports various quantization algorithms, see Supported Hardware for Quantization Kernels — vLLM as well, but you need to enable them with additional flags.