Loading quantized model on CPU only

nielsr · January 28, 2024, 9:00pm

If you want to run inference of quantized LLMs on CPU, it’s recommended to take a look at the llama cpp project: GitHub - ggerganov/llama.cpp: LLM inference in C/C++. This one leverages a new format called GGUF

There’s now also the MLX framework by Apple which allows to run these models on Macbooks: GitHub - ml-explore/mlx: MLX: An array framework for Apple silicon

What you could do is train a model using the Hugging Face tooling (PEFT, TRL, Transformers) and then export your model to the GGUF format: llama.cpp/convert-hf-to-gguf.py at master · ggerganov/llama.cpp · GitHub. You can then run your quantized model on CPU.

Topic		Replies	Views
How to load quantized LLM to CPU only device Intermediate	0	1953	January 28, 2024
Load quantized model in memory Beginners	1	595	December 8, 2023
SmolVLM 8bit Quantization Problem Models	3	530	November 29, 2024
An error i ve been trying to fix for days now Intermediate	4	462	November 19, 2024
"normal_kernel_cpu" not implemented for 'Char' when trying to import 8-bit model 🤗Transformers	6	1887	February 23, 2025

Loading quantized model on CPU only

Related topics