Loading quantized model on CPU only

gugaio · January 28, 2024, 2:52pm

The flag load_in_8bit is used to enable 8-bit quantization with LLM.int8(). LLM.int8 is a lightweight wrapper around CUDA custom functions, so the quantization is only possible in GPU.

You have the required details in offical bitsandbytes github page.

Requirements: Python >=3.8. Linux distribution (Ubuntu, MacOS, etc.) + CUDA > 10.0.

Topic		Replies	Views
How to load quantized LLM to CPU only device Intermediate	0	1953	January 28, 2024
Load quantized model in memory Beginners	1	595	December 8, 2023
SmolVLM 8bit Quantization Problem Models	3	530	November 29, 2024
An error i ve been trying to fix for days now Intermediate	4	462	November 19, 2024
"normal_kernel_cpu" not implemented for 'Char' when trying to import 8-bit model 🤗Transformers	6	1887	February 23, 2025

Loading quantized model on CPU only

Related topics