Error running Llama 3.1 Minitron 4B quantized model with Ollama

Cloudsurfer48902 · August 24, 2024, 9:29pm

Hi,
I’m getting an error while trying to run a quantized version of the Nvidia Llama 3.1 Minitron model using Ollama. I’d appreciate any help that I can get.

Model Details:

Model: Llama-3.1-Minitron-4B-Width-Base.Q4_K.gguf
Source: https://huggingface.co/legraphista/Llama-3.1-Minitron-4B-Width-Base-GGUF
Quantization: Q4_K (4-bit)
File size: 2.78GB
Original model parameters: 4.51B

Steps I’ve taken:

Downloaded the model file: Llama-3.1-Minitron-4B-Width-Base.Q4_K.gguf
Created a modelfile with the following contents:

FROM ./Llama-3.1-Minitron-4B-Width-Base.Q4_K.gguf
SYSTEM """
You are Swedish Chef from the classic Muppet series. You answer every question
"""

Created the model in Ollama:

ollama create swede -f ./modelfile

This step completed successfully.

Attempted to run the model:

ollama run swede

Initial Error encountered:

Error: llama runner process has terminated: signal: aborted (core dumped)
error loading model: check_tensor_dims: tensor 'blk.0.attn_q.weight' has wrong shape; expected 3072, 3072, got 3072, 4096, 1, 1
llama_load_model_from_file: exception loading model

After updating Ollama, new error encountered:

Error: llama runner process has terminated: GGML_ASSERT(c->ne[0] >= n_dims / 2) failed

System Information(Using CPU):

Ollama version: 0.3.6
OS: Ubuntu 22.04.4 LTS x86_64 (server)
Hardware: CPU: QEMU Virtual version 2.1.2 (8)
GPU: 00:02.0 Cirrus Logic GD 5446
Memory: 302MiB / 11956MiB

Additional Information:

The model creation process with Ollama seemed to succeed initially.
The Hugging Face page suggests using llama.cpp for this model, but I’m trying to use it with Ollama.
Other quantization levels are available (Q8_0, Q6_K, Q3_K, Q2_K), but I haven’t tried them yet.
After updating Ollama, the error message changed, but the model still fails to run.

Questions:

Is this GGUF format and quantization level (Q4_K) supported by Ollama 0.3.6?
Could the new error (GGML_ASSERT failed) be related to the model’s compatibility with Ollama?
Do you recommend trying a different quantization level, like Q8_0?
Are there any specific steps I should take to make this model compatible with Ollama 0.3.6?
Could this be related to the model’s architecture or the way it was quantized?

I’d greatly appreciate an yhelp on resolving this issue or suggestions for any alternative approaches. Thanks in advance for your help!

anasainea · August 27, 2024, 3:41pm

I am getting the same error, I tried IQ3_M from :

and IQ3_M and IQ3_S from :

Donkey545 · August 28, 2024, 1:26am

I have a similar problem with all of the Width based versions of this model. I managed to get the Q8 Depth Version to work with Ollama, but there appears to be something wrong with my setup because the model just hallucinates.

It appears that there is some work to get this problem addressed with some changes to Llama.cpp, but there may be some delay before those changes are rolled into Ollama.

Topic		Replies	Views
Finetuning 4bit model Beginners	1	2431	August 29, 2023
Lama 3.23b performs great when I download and use using ollama but when I manually download the model or if I use the gguf model by unsloth, it gives me irrelevant response. Please help me out Beginners	9	1415	October 31, 2024
Ollama + Llama-3.2-11b-vision-uncensored like 22 Beginners	1	1294	December 10, 2024
Running into OOM on GPU with quantized llama-3-8b for text generation inference Models	0	514	June 29, 2024
Fine tune Meta-Llama-3.1-8B OOM error after the 1st training step Models	0	171	September 6, 2024

Error running Llama 3.1 Minitron 4B quantized model with Ollama

Related topics