Issues with Pruning and Quantization of Hugging Face LLMs on CPU

aiengineeringq · October 23, 2025, 9:13am

Hi everyone,

I’m experimenting with pruning and quantizing LLMs (like unclothed/gemma-2-2b-it) and running into several issues. I’m hoping to get advice or best practices. Here’s what I observed:

Model size increases after pruning

After structured pruning (~30%), the model size doubled instead of decreasing.
I suspect this is due to PyTorch mask tensors added during pruning.

Accuracy and inference time unchanged

After pruning, accuracy and response time remain almost identical.
Only a portion of weights were pruned; CPU inference doesn’t get faster automatically.

4-bit quantization on CPU

Attempting 4-bit quantization fails on CPU.
bitsandbytes library is GPU-optimized, so CPU-only systems aren’t supported.

INT8 quantization issues

INT8 quantization sometimes crashes when saving with save_pretrained().
Seems Transformers’ serialization does not fully support int8 tensors on CPU.

Package / environment issues

Missing bitsandbytes → 4-bit quantization fails.
Missing sentencepiece → tokenizer fails.
Missing langchain.text_splitter → ingestion fails.

Saving pruned + quantized model

Pruned + quantized model sometimes fails to save or size doubles.

GPU vs CPU differences

On CPU, cannot benefit from 4-bit quantization or speed-up.
GPU-only optimized kernels are needed for memory and inference improvements.

Questions:

Is there a recommended way to prune and quantize models on CPU without increasing size?
How do people typically handle saving pruned + quantized models?
Any tips to get speed/memory benefits on CPU?
Are there alternative approaches for CPU-only systems to reduce memory while maintaining accuracy?

Thanks in advance for any guidance!

John6666 · October 23, 2025, 2:17pm

While it is possible to avoid the pruning problem itself, the performance degradation caused by pruning is significantly greater compared to quantization. Recovering performance requires extensive fine-tuning using a GPU after pruning, making it a very difficult path. It is best avoided unless for research or specialized purposes.

When aiming for memory savings and speed on CPUs, simply using a smaller LLM and quantization methods & backends optimized for CPUs yields faster speeds and higher accuracy.

aiengineeringq · October 24, 2025, 7:22am

Hi community,
I am pruning the Gemma 2B model using PyTorch’s nn.utils.prune. After pruning ~20–30% of the weights, I notice some performance degradation.

What are the best practices to minimize accuracy loss during pruning?
Is structured pruning recommended over unstructured pruning for LLMs like Gemma?
After pruning, is LoRA fine-tuning sufficient to recover performance?

Title: Pruning + Fine-Tuning Workflow for Gemma 2B
Body:
Hello,
I want to prune Gemma 2B and then fine-tune it on a small dataset.

Should I prune first and then fine-tune with full parameters, or is LoRA/PEFT tuning enough?
Any advice on learning rate, batch size, or number of epochs for fine-tuning after pruning?
Has anyone successfully pruned + fine-tuned Gemma 2B? Any tips?

Quantization Questions

Title: CPU Inference for Gemma 2B After Pruning + Fine-Tuning
Body:
Hi,
I plan to perform pruning + LoRA fine-tuning + quantization for CPU inference on Gemma 2B.

Which quantization method works best for CPU: 8-bit or 4-bit?
Can I combine pruning + LoRA fine-tuning + 4-bit quantization without losing significant accuracy?
Does BitsAndBytes fully support CPU-only quantization for a 2B parameter model?

Hardware / Workflow Questions

Title: Minimum GPU Requirements for Gemma 2B Workflow
Body:
Hello Hugging Face community,
I am planning a workflow for Gemma 2B:

Download the model
Prune ~20–30% weights
Fine-tune with LoRA
Quantize for CPU inference

What is the minimum GPU VRAM required for this workflow?
Can pruning be done entirely on CPU, or is GPU strongly recommended?
Are there any example scripts for pruning + LoRA fine-tuning + quantization for 2B+ LLMs?

Tips before posting:

Include your environment (PyTorch version, GPU/CPU, RAM)
Include code snippet if possible, e.g., how you’re pruning or fine-tuning

Topic		Replies	Views
Optimum Pruning and Quantization Current Limitation 🤗Transformers	4	1006	April 26, 2022
Why are embedding / pooler layers excluded from pruning comparisons? Research	7	806	February 16, 2021
Pegasus Model Weights Compression/Pruning Models	14	4293	February 15, 2023
How to Prune Transformer based Model? 🤗Optimum	2	5771	August 25, 2023
Best open-source model for parsing messy PDFs on 16GB RAM (CPU only) Models	25	873	October 24, 2025