Error running Llama 3.1 Minitron 4B quantized model with Ollama

Hi,
I’m getting an error while trying to run a quantized version of the Nvidia Llama 3.1 Minitron model using Ollama. I’d appreciate any help that I can get.

Model Details:

Steps I’ve taken:

  1. Downloaded the model file: Llama-3.1-Minitron-4B-Width-Base.Q4_K.gguf
  2. Created a modelfile with the following contents:
FROM ./Llama-3.1-Minitron-4B-Width-Base.Q4_K.gguf
SYSTEM """
You are Swedish Chef from the classic Muppet series. You answer every question
"""
  1. Created the model in Ollama:
ollama create swede -f ./modelfile

This step completed successfully.

  1. Attempted to run the model:
ollama run swede

Initial Error encountered:

Error: llama runner process has terminated: signal: aborted (core dumped)
error loading model: check_tensor_dims: tensor 'blk.0.attn_q.weight' has wrong shape; expected 3072, 3072, got 3072, 4096, 1, 1
llama_load_model_from_file: exception loading model

After updating Ollama, new error encountered:

Error: llama runner process has terminated: GGML_ASSERT(c->ne[0] >= n_dims / 2) failed

System Information(Using CPU):

  • Ollama version: 0.3.6
  • OS: Ubuntu 22.04.4 LTS x86_64 (server)
  • Hardware: CPU: QEMU Virtual version 2.1.2 (8)
    GPU: 00:02.0 Cirrus Logic GD 5446
    Memory: 302MiB / 11956MiB

Additional Information:

  • The model creation process with Ollama seemed to succeed initially.
  • The Hugging Face page suggests using llama.cpp for this model, but I’m trying to use it with Ollama.
  • Other quantization levels are available (Q8_0, Q6_K, Q3_K, Q2_K), but I haven’t tried them yet.
  • After updating Ollama, the error message changed, but the model still fails to run.

Questions:

  1. Is this GGUF format and quantization level (Q4_K) supported by Ollama 0.3.6?
  2. Could the new error (GGML_ASSERT failed) be related to the model’s compatibility with Ollama?
  3. Do you recommend trying a different quantization level, like Q8_0?
  4. Are there any specific steps I should take to make this model compatible with Ollama 0.3.6?
  5. Could this be related to the model’s architecture or the way it was quantized?

I’d greatly appreciate an yhelp on resolving this issue or suggestions for any alternative approaches. Thanks in advance for your help!

I am getting the same error, I tried IQ3_M from :

and IQ3_M and IQ3_S from :

I have a similar problem with all of the Width based versions of this model. I managed to get the Q8 Depth Version to work with Ollama, but there appears to be something wrong with my setup because the model just hallucinates.

It appears that there is some work to get this problem addressed with some changes to Llama.cpp, but there may be some delay before those changes are rolled into Ollama.