Optimum arm64 quantized models on Apple Silicon (M1)

Hi everyone,

I’m running some quantization experiments on my MBP 13 M1 with optimum and onnx. I see that arm64 quantized model use 2x less time for an inference pass than basic ORT model without quantization. At the same time, if I ran avx2-quantized model, it also uses less time for an inference pass, but it is not as fast as arm64 model(see screenshot). Does optimum actually uses arm64 instructions during inference of arm64-quantized models? Or the speedup is just a result of some other default optimizations?

Hi @Rexhaif, sorry for the late response :frowning:

The only difference between the arm64 quantization config and the avx2 one is that weights are mapped to signed 8-bit integers for arm64 and unsigned 8-bit integers for avx2. So I guess this fits better the M1.