Hi everyone,
I’m experimenting with pruning and quantizing LLMs (like unclothed/gemma-2-2b-it
) and running into several issues. I’m hoping to get advice or best practices. Here’s what I observed:
Model size increases after pruning
-
After structured pruning (~30%), the model size doubled instead of decreasing.
-
I suspect this is due to PyTorch mask tensors added during pruning.
Accuracy and inference time unchanged
-
After pruning, accuracy and response time remain almost identical.
-
Only a portion of weights were pruned; CPU inference doesn’t get faster automatically.
4-bit quantization on CPU
-
Attempting 4-bit quantization fails on CPU.
-
bitsandbytes
library is GPU-optimized, so CPU-only systems aren’t supported.
INT8 quantization issues
-
INT8 quantization sometimes crashes when saving with
save_pretrained()
. -
Seems Transformers’ serialization does not fully support int8 tensors on CPU.
Package / environment issues
-
Missing
bitsandbytes
→ 4-bit quantization fails. -
Missing
sentencepiece
→ tokenizer fails. -
Missing
langchain.text_splitter
→ ingestion fails.
Saving pruned + quantized model
- Pruned + quantized model sometimes fails to save or size doubles.
GPU vs CPU differences
-
On CPU, cannot benefit from 4-bit quantization or speed-up.
-
GPU-only optimized kernels are needed for memory and inference improvements.
Questions:
-
Is there a recommended way to prune and quantize models on CPU without increasing size?
-
How do people typically handle saving pruned + quantized models?
-
Any tips to get speed/memory benefits on CPU?
-
Are there alternative approaches for CPU-only systems to reduce memory while maintaining accuracy?
Thanks in advance for any guidance!