Loading Llama 2 with quantization on M1 MacBooks

Hi,

I’m following this tutorial Llama 2: AI Developers Handbook | Pinecone, which explains how to load Llama-2 with quantization. I’ve worked through a couple of error messages that I managed to solve. However, now I am stuck with the following response from the model:

“FP4 quantization state not initialized. Please call .cuda() or .to(device) on the LinearFP4 layer first.”

In addition, I’m getting a TypeError: “BFloat16 is not supported on MPS”

I tried doing so with `model.to(‘cpu’), however that resulted in an error since this operation apparently doesn’t work with quantized models.

I don’t fully understand what the message means. My guess is that it’s related to my hardware (I’m using an M1 notebook).

Could someone with a better understanding of the underlying architecture help me debug this?

As a more general question: should it normally be possible to quantize the model on M1 MacBooks?

Thanks in advance

3 Likes

It is frustrating to fine tune llama2 on Mac silicon.
Why do conversions? Why cannot just quantized and work as we do on Intel?
Is anyone working on that or we just replace Mac silicon with Intel?

Salam Iraj
have you found any solution?