Hi,
I’m following this tutorial Llama 2: AI Developers Handbook | Pinecone, which explains how to load Llama-2 with quantization. I’ve worked through a couple of error messages that I managed to solve. However, now I am stuck with the following response from the model:
“FP4 quantization state not initialized. Please call .cuda() or .to(device) on the LinearFP4 layer first.”
In addition, I’m getting a TypeError: “BFloat16 is not supported on MPS”
I tried doing so with `model.to(‘cpu’), however that resulted in an error since this operation apparently doesn’t work with quantized models.
I don’t fully understand what the message means. My guess is that it’s related to my hardware (I’m using an M1 notebook).
Could someone with a better understanding of the underlying architecture help me debug this?
As a more general question: should it normally be possible to quantize the model on M1 MacBooks?
Thanks in advance