Loading Llama 2 with quantization on M1 MacBooks

Hi,

I’m following this tutorial Llama 2: AI Developers Handbook | Pinecone, which explains how to load Llama-2 with quantization. I’ve worked through a couple of error messages that I managed to solve. However, now I am stuck with the following response from the model:

“FP4 quantization state not initialized. Please call .cuda() or .to(device) on the LinearFP4 layer first.”

In addition, I’m getting a TypeError: “BFloat16 is not supported on MPS”

I tried doing so with `model.to(‘cpu’), however that resulted in an error since this operation apparently doesn’t work with quantized models.

I don’t fully understand what the message means. My guess is that it’s related to my hardware (I’m using an M1 notebook).

Could someone with a better understanding of the underlying architecture help me debug this?

As a more general question: should it normally be possible to quantize the model on M1 MacBooks?

Thanks in advance

2 Likes