Hi, just need you guys opinion,
I have a chatbot system that run using below model:
- llama 3.2 90B
- llama 3.2 3B
- Whisper (large)
- nomic-embed-text (Embedding)
What is the best requirement cluster GPU needed to run above model?
Hi, just need you guys opinion,
I have a chatbot system that run using below model:
What is the best requirement cluster GPU needed to run above model?
Apart from Llama 90B, the others are small (I don’t think they add up to 10GB), so we’ll look at the operating conditions for Llama 90B. At least 64GB of VRAM is required in 4-bit quantization. If you want to use fp16, you’ll need 256GB. In addition to loading the model, a little extra VRAM is required during inference, so it’s best to secure a little more VRAM than the model size.
There is a big difference in cost between 64GB and 256GB, so it’s better to run it in a quantized state (GGUF or NF4) somehow. Well, if you’re using Ollama or Llamacpp as a server, you just need to use the Q4_K_M format GGUF.
Hi @John6666
Thank you for your response!
I am currently using Ollama to run Llama models. Would a cluster of four NVIDIA RTX A6000 GPUs be sufficient to handle the model?
Thanks!
It has 192GB of VRAM!
Ollama’s default is Q4_K_M (if you don’t specify otherwise, it uses this, which is unlikely to cause problems), so I think it will be more than enough. The amount of consumption is a little over 64GB for the model, plus a little for inference. If there are a lot of people using it at the same time, or if you want to process very long sentences, it will use more VRAM, but it will still be unlikely to cause problems.
By the way, Ollama is also fast enough, but Llamacpp seems to be even faster. If you’re having trouble with speed, you might want to try changing it.