Unable to run gguf model

hyadav22 · January 6, 2025, 10:45am

I have a server with 250gb of RAM(64 core), but no GPU. I want to run quantized llama3 70B model, i have tried Llama-3.3-70B-Instruct-Q5_K_M.gguf and Llama-3.3-70B-Instruct-Q3_K_M.gguf. But they are failing during model load phase due to RAM limitation.

Code fails during model loading phase, it says “Converting and de-quantizing GGUF tensor…”, during this time I have checked RAM utilization from htop cmmnd, RAM is occupied completely, and the program will be killed .

I want to know, what does it mean by “Converting and de-quantizing GGUF tensor…” and is there any error in my code ?

Below is my code:

from transformers import pipeline
from transformers import AutoTokenizer, AutoModelForCausalLM, pipeline, AutoModelForSeq2SeqLM

model initialization

model = ‘/vivo/genai_poc/hf_model/Llama-3.3-70B-Instruct-Q3_K_M’
path = ‘/vivo/genai_poc/hf_model/Llama-3.3-70B-Instruct-Q3_K_M/Llama-3.3-70B-Instruct-Q3_K_M.gguf’

tokenizer = AutoTokenizer.from_pretrained(model, gguf_file=path)
model = AutoModelForCausalLM.from_pretrained(model, gguf_file=path)

text2text = pipeline(“text2text-generation”, model=model, tokenizer=tokenizer)

Get the answer

result = text2text(“translate English to French: New Delhi is India’s capital”)

print(f"Answer: {result[‘answer’]}")

John6666 · January 6, 2025, 10:59am

I think that the current Transformers cannot execute GGUF in its quantized state, so Transformers will first de-quantize it. If you don’t specify the data type, it will be expanded to 32-bit floating point, so from Q4, the size will be 8 times larger…
If you are using GGUF from the CPU, I think that using Llamacpp or Ollama is easy, memory-saving, and fast.
If you need to train the model, you will need Transformers…

github.com/huggingface/transformers

Memory Issues when Attempting to Load GGUF Tensors in transformers

opened 04:09PM - 25 Oct 24 UTC

closed 08:05AM - 25 Dec 24 UTC

mpperez3

bug

### System Info Environment: OS: Ubuntu 24.04 Python version: 3.11.8 Trans…formers version: transformers==4.45.2 Torch version: torch==2.3.0 Model: Meta-Llama-3.1-70B-Q2_K-GGUF - https://huggingface.co/Ffftdtd5dtft/Meta-Llama-3.1-70B-Q2_K-GGUF ### Who can help? text models: @ArthurZucker generate: @zucchini-nlp ### Information - [X] The official example scripts ### Reproduction Description: I am attempting to load a quantized GGUF model (Meta-Llama-3.1-70B-Q2_K-GGUF) using the AutoTokenizer and AutoModel classes from Hugging Face transformers, but I am encountering a severe memory RAM usage during the de-quantization process (more than 90GB). **Steps to Reproduce:** ``` tokenizer = AutoTokenizer.from_pretrained("Ffftdtd5dtft/Meta-Llama-3.1-70B-Q2_K-GGUF", trust_remote_code=True, gguf_file="Meta-Llama-3.1-70B-Q2_K-GGUF") model = AutoModel.from_pretrained("Ffftdtd5dtft/Meta-Llama-3.1-70B-Q2_K-GGUF", device_map='auto', trust_remote_code=True, gguf_file="Meta-Llama-3.1-70B-Q2_K-GGUF") ``` **Observed Behavior:** Memory Issues: When attempting to load the GGUF model (Meta-Llama-3.1-70B-Q2_K), the system quickly exhausts available memory during the de-quantization. It use more than 90GB. I also try other different models such as https://huggingface.co/mradermacher/Meta-Llama-3.1-70B-GGUF ### Expected behavior - Efficient Loading: I expect the GGUF model (like Meta-Llama-3.1-70B-Q2_K-GGUF) to be loaded correctly using the transformers library, with a clear and efficient process for de-quantizing and loading the model without causing memory exhaustion, even in systems with less than 100GB of RAM. - Support for Lower RAM Systems: Given that GGUF is a quantized format designed for efficiency, it would be ideal if transformers could either support a more memory-optimized loading process or allow partial model loading, enabling users with lower RAM systems (e.g., < 100GB) to load and run these models effectively. - Alternative Loading Method: If loading such models directly with AutoModel and AutoTokenizer is not possible due to GGUF-specific constraints, it would be helpful to have documentation or tools to convert these models into a compatible format (such as PyTorch) that can be handled within transformers.

Topic		Replies	Views
NotImplementedError: ggml_type 21 not implemented 🤗Transformers	2	90	September 23, 2024
ValueError: cannot reshape array of size (GGUF) 🤗Transformers	4	895	July 31, 2024
Why I am getting this problem while running any of the GGUF model instead of with .bin model Models	1	2645	November 7, 2023
Failed to create LLM 'llama' from .GGUF Beginners	0	344	December 25, 2024
Finetuned LLM model conversion to GGUF - performance drop Models	4	1996	July 31, 2024

Unable to run gguf model

model initialization

Get the answer

Related topics