Unable to run quantized Llama2 70b model

hyadav22 · December 29, 2024, 11:46am

I have a server with 250GB of RAM, but when I try to run the Llama-3.3-70B quantized models, they fail to load due to RAM limitations. Specifically, I’ve tried running the following models from Unsloft:

Llama-3.3-70B-Instruct-Q5_K_M.gguf
Llama-3.3-70B-Instruct-Q3_K_M.gguf

Both fail during the loading process.

Could anyone let me know the RAM requirements to run the Llama-3.3-70B-Instruct-Q3_K_M.gguf model without a GPU?

John6666 · December 29, 2024, 1:28pm

Generally speaking, if you have RAM with a capacity similar to the file size, it is often possible to load it. It is also reassuring to have a little extra when running the inference. So I think you have enough RAM.

If there is a problem, it would be the VRAM size. If you set everything to load into the GPU, it would be quite difficult because you would need that much VRAM size, not RAM. Check that the setting uses RAM for the CPU as well.

ValdeJunior · December 30, 2024, 3:27pm

There may be several reasons.

Activation memory: When running inference, especially for large models, you need additional memory to store activations and gradients, even for inference. This could be a significant portion of your memory usage, especially if you’re running a batch of multiple inputs at once.
Load/Storage overhead: The loading process itself may require more memory than just the model weights, due to how the system loads and processes the model data.
System Configuration/Process Limits: Ensure that the memory allocation limits of the operating system or the model-loading framework are not being hit. Some systems have memory limits per process or may have settings that restrict the amount of memory a process can use.

Topic		Replies	Views
Unable to run gguf model Models	1	812	January 6, 2025
Need Suggestions for LLM Models Suitable for 250GB RAM Server Models	0	163	December 29, 2024
Fine Tuning LLama 3.2 1B Quantized Memory Requirements Models	6	1417	June 16, 2025
Llama 3.1 70-B run on 32 GB Vram? 🤗Transformers	5	3781	September 20, 2024
How to deploy larger model inference on multiple machine with multiple GPU？ 🤗Transformers	1	2543	December 19, 2023

Unable to run quantized Llama2 70b model

Related topics