Inference 8 bit or 4 bit bit models on cpu?

sd3ntato · July 18, 2023, 5:57pm

hello, is it possible to run inference of quantized 8 bit or 4 bit models on cpu?

YaTharThShaRma999 · August 2, 2023, 9:06pm

I don’t believe so since the bitsandbytes library is just a wrapper around cuda functions which are for gpu.

sd3ntato · August 3, 2023, 8:14am

for those still searching, I found some sources about

optimum intel for quantization on intel cpus
🤗 Optimum Intel
core ml has quantization tools for apple cpus
Compressing Neural Network Weights

Topic		Replies	Views
Loading quantized model on CPU only 🤗Transformers	6	18593	February 3, 2025
4bit quantization on inference end point Inference Endpoints on the Hub	0	275	January 16, 2024
Errors running Inference Endpoint with quantized model Inference Endpoints on the Hub	2	794	September 14, 2023
Should 8bit quantization make inference faster on GPU? 🤗Transformers	1	671	April 1, 2024
Deepspeed inference and infinity offload with bitsandbytes 4bit loaded models DeepSpeed	2	3855	July 27, 2023