NotImplementedError: ggml_type 21 not implemented

I’m running a 3 bit quantized model of Llama 3.1 70b:

from transformers import AutoTokenizer, AutoModelForCausalLM

model_id = "bartowski/Meta-Llama-3.1-70B-Instruct-GGUF"
filename = "Meta-Llama-3.1-70B-Instruct-IQ3_XS.gguf"

tokenizer = AutoTokenizer.from_pretrained(model_id, gguf_file=filename)
model = AutoModelForCausalLM.from_pretrained(model_id, gguf_file=filename)

having the following error:


  File "***/lib/python3.11/site-packages/transformers/integrations/ggml.py", line 510, in load_dequant_gguf_tensor
    raise NotImplementedError(
NotImplementedError: ggml_type 21 not implemented

I’m using

  • transformers 4.44.2
  • gguf 0.10.0

I’m doing something wrong? Thanks.

There are multiple issues, but to put it simply, the GGUF files that can be used in transformers and the GGUF files that can be used in Llamacpp are currently effectively two different things.
Think of it as two different dialects.
I think GGUF was originally a proprietary format at the Llamacpp development…
The one you are trying to use is for Llamacpp. The author of the quantization can usually tell.

And since it is unlikely that anyone publishes a GGUF for transformers, you can either use that GGUF in Llamacpp or find a non-GGUF quantization file (such as GPTQ or bitsandbytes NF4) for transformers.

Understood, thanks!