Issues with Pruning and Quantization of Hugging Face LLMs on CPU

Hi everyone,

I’m experimenting with pruning and quantizing LLMs (like unclothed/gemma-2-2b-it) and running into several issues. I’m hoping to get advice or best practices. Here’s what I observed:


:one: Model size increases after pruning

  • After structured pruning (~30%), the model size doubled instead of decreasing.

  • I suspect this is due to PyTorch mask tensors added during pruning.


:two: Accuracy and inference time unchanged

  • After pruning, accuracy and response time remain almost identical.

  • Only a portion of weights were pruned; CPU inference doesn’t get faster automatically.


:three: 4-bit quantization on CPU

  • Attempting 4-bit quantization fails on CPU.

  • bitsandbytes library is GPU-optimized, so CPU-only systems aren’t supported.


:four: INT8 quantization issues

  • INT8 quantization sometimes crashes when saving with save_pretrained().

  • Seems Transformers’ serialization does not fully support int8 tensors on CPU.


:five: Package / environment issues

  • Missing bitsandbytes → 4-bit quantization fails.

  • Missing sentencepiece → tokenizer fails.

  • Missing langchain.text_splitter → ingestion fails.


:six: Saving pruned + quantized model

  • Pruned + quantized model sometimes fails to save or size doubles.

:seven: GPU vs CPU differences

  • On CPU, cannot benefit from 4-bit quantization or speed-up.

  • GPU-only optimized kernels are needed for memory and inference improvements.


Questions:

  1. Is there a recommended way to prune and quantize models on CPU without increasing size?

  2. How do people typically handle saving pruned + quantized models?

  3. Any tips to get speed/memory benefits on CPU?

  4. Are there alternative approaches for CPU-only systems to reduce memory while maintaining accuracy?

Thanks in advance for any guidance!

1 Like

While it is possible to avoid the pruning problem itself, the performance degradation caused by pruning is significantly greater compared to quantization. Recovering performance requires extensive fine-tuning using a GPU after pruning, making it a very difficult path. It is best avoided unless for research or specialized purposes.

When aiming for memory savings and speed on CPUs, simply using a smaller LLM and quantization methods & backends optimized for CPUs yields faster speeds and higher accuracy.

1 Like

Hi community,
I am pruning the Gemma 2B model using PyTorch’s nn.utils.prune. After pruning ~20–30% of the weights, I notice some performance degradation.

  • What are the best practices to minimize accuracy loss during pruning?

  • Is structured pruning recommended over unstructured pruning for LLMs like Gemma?

  • After pruning, is LoRA fine-tuning sufficient to recover performance?


Title: Pruning + Fine-Tuning Workflow for Gemma 2B
Body:
Hello,
I want to prune Gemma 2B and then fine-tune it on a small dataset.

  • Should I prune first and then fine-tune with full parameters, or is LoRA/PEFT tuning enough?

  • Any advice on learning rate, batch size, or number of epochs for fine-tuning after pruning?

  • Has anyone successfully pruned + fine-tuned Gemma 2B? Any tips?


:two: Quantization Questions

Title: CPU Inference for Gemma 2B After Pruning + Fine-Tuning
Body:
Hi,
I plan to perform pruning + LoRA fine-tuning + quantization for CPU inference on Gemma 2B.

  • Which quantization method works best for CPU: 8-bit or 4-bit?

  • Can I combine pruning + LoRA fine-tuning + 4-bit quantization without losing significant accuracy?

  • Does BitsAndBytes fully support CPU-only quantization for a 2B parameter model?


:three: Hardware / Workflow Questions

Title: Minimum GPU Requirements for Gemma 2B Workflow
Body:
Hello Hugging Face community,
I am planning a workflow for Gemma 2B:

  1. Download the model

  2. Prune ~20–30% weights

  3. Fine-tune with LoRA

  4. Quantize for CPU inference

  • What is the minimum GPU VRAM required for this workflow?

  • Can pruning be done entirely on CPU, or is GPU strongly recommended?

  • Are there any example scripts for pruning + LoRA fine-tuning + quantization for 2B+ LLMs?


:light_bulb: Tips before posting:

  • Include your environment (PyTorch version, GPU/CPU, RAM)

  • Include code snippet if possible, e.g., how you’re pruning or fine-tuning

1 Like