Sharing a project I’ve been working on: a compression scheme for LLM checkpoints based on GMM + ANS.
Motivation
Moving fine-tuned models around is expensive. A typical SFT checkpoint sits at 1.5–2 GB in safetensors format, and at GCS (inter-region) egress pricing ($0.12/GB), each download costs ~$0.20. At 1,000 downloads/month per model version that adds up fast — and optimizer states push a single training checkpoint past 5 GB. General-purpose compressors (gzip, zstd) barely work on float32 weights since they target byte-level patterns, and quantization methods change the computational graph. Gauss is designed for the archival/transfer case: compress once, recover the original dtype on demand.
What it does
Trained LLM weight tensors have strongly multi-modal value distributions — weights in a given layer cluster around a small number of modes. Gauss exploits this by fitting a K=16 Gaussian Mixture Model to each tensor, then entropy-coding the cluster assignments and residuals independently via ANS (Asymmetric Numeral Systems).
The pipeline per tensor:
- Hard EM to fit GMM (subsampled to 200k elements for large tensors)
- Assign every weight to its MAP cluster
- Quantize the residual:
r = clip(round((w - μ) × S), -32767, 32767) - ANS-encode two streams separately: index stream (Categorical) + residual stream (QuantizedGaussian)
Decompression is exact: ŵ = μ_cluster + r / S, cast back to original dtype.
Results
On a 24-layer SFT checkpoint (float32, 1645 MB):
1,645 MB → 335 MB (4.90×) max error ±5×10⁻⁴
Compression: 355s (2 workers, Colab CPU)
Decompression: 95s
On bf16 models, an adaptive scale cap kicks in (S drops from 1000 → 128) to avoid encoding sub-epsilon noise, giving ~3.3× in practice.
Per-layer breakdown: o_proj compresses best (5.3–6.5×), embedding tables worst (4.09×). The variance is meaningful and consistent with the interpretation that output projections have the most concentrated distributions.
Comparison
| Method | Ratio | Lossless? | Error |
|---|---|---|---|
| gzip / zstd | ~1.02–1.03× | Yes | -– |
| INT8 quantization | 4.00× | No | O(10⁻³) |
| Gauss | 4.90× | No | ±5×10⁻⁴ |
| INT4 quantization | 8.00× | No | O(10⁻²) |
For bf16, the ±5×10⁻⁴ error bound is well within bf16’s own precision limit (±3.9×10⁻³), so functionally it doesn’t matter.
Limitations (honest)
- Lossy — not suitable if you need bit-exact reproduction
- Tensors processed independently, no cross-layer structure exploited
- ANS encoding within each shard is sequential
A note on gauss info output
The info command computes the original size from tensor shape metadata rather than reading the actual file bytes. This means the displayed ratio can differ from what compress reports — particularly for bf16/fp16 models, where the stored dtype is 2 bytes per element but a miscalculation would show 2× inflated ratios. If the numbers look surprisingly high in info, cross-check against the actual file sizes from compress.
Links
- GitHub: GitHub - skystarry-ai/gauss: GAUSS: Distribution-Aware Compression for Neural Network Weights · GitHub
pip install gauss-compress- Technical report: GAUSS: Distribution-Aware Compression for Neural Network Weights
Curious if anyone has tried similar approaches or has thoughts on the adaptive K direction.