Loading Llama 2 with quantization on M1 MacBooks

Freshramensoup · August 2, 2023, 2:27pm

Hi,

I’m following this tutorial Llama 2: AI Developers Handbook | Pinecone, which explains how to load Llama-2 with quantization. I’ve worked through a couple of error messages that I managed to solve. However, now I am stuck with the following response from the model:

“FP4 quantization state not initialized. Please call .cuda() or .to(device) on the LinearFP4 layer first.”

In addition, I’m getting a TypeError: “BFloat16 is not supported on MPS”

I tried doing so with `model.to(‘cpu’), however that resulted in an error since this operation apparently doesn’t work with quantized models.

I don’t fully understand what the message means. My guess is that it’s related to my hardware (I’m using an M1 notebook).

Could someone with a better understanding of the underlying architecture help me debug this?

As a more general question: should it normally be possible to quantize the model on M1 MacBooks?

Thanks in advance

irajkoohi · October 18, 2023, 2:22pm

It is frustrating to fine tune llama2 on Mac silicon.
Why do conversions? Why cannot just quantized and work as we do on Intel?
Is anyone working on that or we just replace Mac silicon with Intel?

marabgol · December 15, 2023, 9:47pm

Salam Iraj
have you found any solution?

Topic		Replies	Views
About using llama-cpp-python Beginners	0	424	July 8, 2024
Quantizing a model on M1 Mac for qlora 🤗Transformers	0	1646	March 14, 2024
Finetuning LLama2-70B using 4-bit quantization on multi-GPU using Deepspeed ZeRO Intermediate	1	2421	March 19, 2024
How to run large LLMs like Llama 3.1 70B or Mixtral 8x22B with limited GPU VRAM? Beginners	2	1654	September 26, 2024
[Guide] Quantize LLM CoreML to int8 on Mac ARM (TinyLlama, May 2025, tested workflow & script) 🤗Optimum	0	47	May 26, 2025

Loading Llama 2 with quantization on M1 MacBooks

Related topics