Issues with Pruning and Quantization of Hugging Face LLMs on CPU

Hi everyone,

I’m experimenting with pruning and quantizing LLMs (like unclothed/gemma-2-2b-it) and running into several issues. I’m hoping to get advice or best practices. Here’s what I observed:


:one: Model size increases after pruning

  • After structured pruning (~30%), the model size doubled instead of decreasing.

  • I suspect this is due to PyTorch mask tensors added during pruning.


:two: Accuracy and inference time unchanged

  • After pruning, accuracy and response time remain almost identical.

  • Only a portion of weights were pruned; CPU inference doesn’t get faster automatically.


:three: 4-bit quantization on CPU

  • Attempting 4-bit quantization fails on CPU.

  • bitsandbytes library is GPU-optimized, so CPU-only systems aren’t supported.


:four: INT8 quantization issues

  • INT8 quantization sometimes crashes when saving with save_pretrained().

  • Seems Transformers’ serialization does not fully support int8 tensors on CPU.


:five: Package / environment issues

  • Missing bitsandbytes → 4-bit quantization fails.

  • Missing sentencepiece → tokenizer fails.

  • Missing langchain.text_splitter → ingestion fails.


:six: Saving pruned + quantized model

  • Pruned + quantized model sometimes fails to save or size doubles.

:seven: GPU vs CPU differences

  • On CPU, cannot benefit from 4-bit quantization or speed-up.

  • GPU-only optimized kernels are needed for memory and inference improvements.


Questions:

  1. Is there a recommended way to prune and quantize models on CPU without increasing size?

  2. How do people typically handle saving pruned + quantized models?

  3. Any tips to get speed/memory benefits on CPU?

  4. Are there alternative approaches for CPU-only systems to reduce memory while maintaining accuracy?

Thanks in advance for any guidance!

1 Like