Unable to run gguf model

I have a server with 250gb of RAM(64 core), but no GPU. I want to run quantized llama3 70B model, i have tried Llama-3.3-70B-Instruct-Q5_K_M.gguf and Llama-3.3-70B-Instruct-Q3_K_M.gguf. But they are failing during model load phase due to RAM limitation.

Code fails during model loading phase, it says “Converting and de-quantizing GGUF tensor…”, during this time I have checked RAM utilization from htop cmmnd, RAM is occupied completely, and the program will be killed .

I want to know, what does it mean by “Converting and de-quantizing GGUF tensor…” and is there any error in my code ?

Below is my code:

from transformers import pipeline
from transformers import AutoTokenizer, AutoModelForCausalLM, pipeline, AutoModelForSeq2SeqLM

model initialization

model = ‘/vivo/genai_poc/hf_model/Llama-3.3-70B-Instruct-Q3_K_M’
path = ‘/vivo/genai_poc/hf_model/Llama-3.3-70B-Instruct-Q3_K_M/Llama-3.3-70B-Instruct-Q3_K_M.gguf’

tokenizer = AutoTokenizer.from_pretrained(model, gguf_file=path)
model = AutoModelForCausalLM.from_pretrained(model, gguf_file=path)

text2text = pipeline(“text2text-generation”, model=model, tokenizer=tokenizer)

Get the answer

result = text2text(“translate English to French: New Delhi is India’s capital”)

print(f"Answer: {result[‘answer’]}")

1 Like

I think that the current Transformers cannot execute GGUF in its quantized state, so Transformers will first de-quantize it. If you don’t specify the data type, it will be expanded to 32-bit floating point, so from Q4, the size will be 8 times larger…
If you are using GGUF from the CPU, I think that using Llamacpp or Ollama is easy, memory-saving, and fast.
If you need to train the model, you will need Transformers…