Code fails during model loading phase, it says “Converting and de-quantizing GGUF tensor…”, during this time I have checked RAM utilization from htop cmmnd, RAM is occupied completely, and the program will be killed .
I want to know, what does it mean by “Converting and de-quantizing GGUF tensor…” and is there any error in my code ?
Below is my code:
from transformers import pipeline
from transformers import AutoTokenizer, AutoModelForCausalLM, pipeline, AutoModelForSeq2SeqLM
model initialization
model = ‘/vivo/genai_poc/hf_model/Llama-3.3-70B-Instruct-Q3_K_M’
path = ‘/vivo/genai_poc/hf_model/Llama-3.3-70B-Instruct-Q3_K_M/Llama-3.3-70B-Instruct-Q3_K_M.gguf’
tokenizer = AutoTokenizer.from_pretrained(model, gguf_file=path)
model = AutoModelForCausalLM.from_pretrained(model, gguf_file=path)
I think that the current Transformers cannot execute GGUF in its quantized state, so Transformers will first de-quantize it. If you don’t specify the data type, it will be expanded to 32-bit floating point, so from Q4, the size will be 8 times larger…
If you are using GGUF from the CPU, I think that using Llamacpp or Ollama is easy, memory-saving, and fast.
If you need to train the model, you will need Transformers…