I’ve been building a fine-tuning SaaS and wanted to share some ablation results from a stability adapter I developed called CRMA (Constrained Residual Mixing Adapter). Happy to get feedback from anyone working on PEFT or training stability.
What is CRMA?
CRMA is a low-rank adapter that runs alongside LoRA/QLoRA. It applies a Sinkhorn-constrained doubly stochastic mixing matrix to the residual stream at each transformer block, keeping gradient dynamics stable during fine-tuning. Inspired by the mHC architecture (DeepSeek-AI, arXiv:2512.24880), which proved at 27B scale that unconstrained multi-stream mixing causes catastrophic divergence.
Key design choices:
- Rank-4 low-rank stream projections (keeps param count ~1M vs LoRA’s ~5M on Mistral-7B)
- PiSSA initialization (NeurIPS 2024, arXiv:2404.02948) for better early learning signal
- ZClip adaptive gradient clipping (arXiv:2504.02507)
- Near-identity initialization: CRMA doesn’t disturb pretrained weights at step 0
- Separate LR for structural (log_mix) vs gate (_log_alpha) parameters
Ablation Results
All runs: same model, same dataset (200-row Alpaca subset), same seed (42), same hyperparams. Only CRMA on vs off.
TinyLlama-1.1B:
| Metric | LoRA only | LoRA + CRMA | Delta |
|---|---|---|---|
| Final loss | 0.1658 | 0.1651 | -0.4% (noise floor) |
| Peak grad norm | 12.15 | 5.75 | -52.7% |
| Mean grad norm | 2.34 | 2.07 | -11.5% |
| Spectral norm | - | 1.000000 | guaranteed <= 1 |
Mistral-7B (the key result):
Plain LoRA hit a catastrophic gradient spike at step 43 (grad norm ~263). CRMA held the same step at ~3.0 — a 98.9% reduction. The run completed cleanly.
Before the CRMA init bug fix, Mistral runs were showing +20% worse loss. After fixing the read_weights initialization (was silently compressing hidden states to 33% of original magnitude across 32 layers), the loss gap dropped to ~1-2% on this tiny dataset.
HF Space
Would genuinely value feedback on the approach — especially from anyone who has worked on gradient stability or Sinkhorn-based methods.