Optimum arm64 quantized models on Apple Silicon (M1)

Rexhaif · February 16, 2023, 9:37pm

Hi everyone,

I’m running some quantization experiments on my MBP 13 M1 with optimum and onnx. I see that arm64 quantized model use 2x less time for an inference pass than basic ORT model without quantization. At the same time, if I ran avx2-quantized model, it also uses less time for an inference pass, but it is not as fast as arm64 model(see screenshot). Does optimum actually uses arm64 instructions during inference of arm64-quantized models? Or the speedup is just a result of some other default optimizations?

regisss · May 3, 2023, 11:45am

Hi @Rexhaif, sorry for the late response

The only difference between the arm64 quantization config and the avx2 one is that weights are mapped to signed 8-bit integers for arm64 and unsigned 8-bit integers for avx2. So I guess this fits better the M1.

Topic		Replies	Views
Inference 8 bit or 4 bit bit models on cpu? Beginners	2	3097	August 3, 2023
Quantized Model size difference when using Optimum vs. Onnxruntime 🤗Optimum	3	1519	July 14, 2022
How to quantize and run inference for CLIP using optimum 🤗Optimum	1	241	June 3, 2024
Improving Quantization Accuracy for ONNX Models with Optimum 🤗Optimum	0	719	February 8, 2024
Optimum & T5 for inference 🤗Optimum	18	5803	February 8, 2023

Optimum arm64 quantized models on Apple Silicon (M1)

Related topics