I have a server with 250GB of RAM, but when I try to run the Llama-3.3-70B quantized models, they fail to load due to RAM limitations. Specifically, I’ve tried running the following models from Unsloft:
- Llama-3.3-70B-Instruct-Q5_K_M.gguf
- Llama-3.3-70B-Instruct-Q3_K_M.gguf
Both fail during the loading process.
Could anyone let me know the RAM requirements to run the Llama-3.3-70B-Instruct-Q3_K_M.gguf model without a GPU?
1 Like
Generally speaking, if you have RAM with a capacity similar to the file size, it is often possible to load it. It is also reassuring to have a little extra when running the inference. So I think you have enough RAM.
If there is a problem, it would be the VRAM size. If you set everything to load into the GPU, it would be quite difficult because you would need that much VRAM size, not RAM. Check that the setting uses RAM for the CPU as well.
There may be several reasons.
- Activation memory: When running inference, especially for large models, you need additional memory to store activations and gradients, even for inference. This could be a significant portion of your memory usage, especially if you’re running a batch of multiple inputs at once.
- Load/Storage overhead: The loading process itself may require more memory than just the model weights, due to how the system loads and processes the model data.
- System Configuration/Process Limits: Ensure that the memory allocation limits of the operating system or the model-loading framework are not being hit. Some systems have memory limits per process or may have settings that restrict the amount of memory a process can use.
1 Like