CRMA: A low-rank stability adapter for QLoRA fine-tuning — ablation results on TinyLlama, Gemma, Mistral

I’ve been building a fine-tuning SaaS and wanted to share some ablation results from a stability adapter I developed called CRMA (Constrained Residual Mixing Adapter). Happy to get feedback from anyone working on PEFT or training stability.

What is CRMA?

CRMA is a low-rank adapter that runs alongside LoRA/QLoRA. It applies a Sinkhorn-constrained doubly stochastic mixing matrix to the residual stream at each transformer block, keeping gradient dynamics stable during fine-tuning. Inspired by the mHC architecture (DeepSeek-AI, arXiv:2512.24880), which proved at 27B scale that unconstrained multi-stream mixing causes catastrophic divergence.

Key design choices:

  • Rank-4 low-rank stream projections (keeps param count ~1M vs LoRA’s ~5M on Mistral-7B)
  • PiSSA initialization (NeurIPS 2024, arXiv:2404.02948) for better early learning signal
  • ZClip adaptive gradient clipping (arXiv:2504.02507)
  • Near-identity initialization: CRMA doesn’t disturb pretrained weights at step 0
  • Separate LR for structural (log_mix) vs gate (_log_alpha) parameters

Ablation Results

All runs: same model, same dataset (200-row Alpaca subset), same seed (42), same hyperparams. Only CRMA on vs off.

TinyLlama-1.1B:

Metric LoRA only LoRA + CRMA Delta
Final loss 0.1658 0.1651 -0.4% (noise floor)
Peak grad norm 12.15 5.75 -52.7%
Mean grad norm 2.34 2.07 -11.5%
Spectral norm - 1.000000 guaranteed <= 1

Mistral-7B (the key result):

Plain LoRA hit a catastrophic gradient spike at step 43 (grad norm ~263). CRMA held the same step at ~3.0 — a 98.9% reduction. The run completed cleanly.

Before the CRMA init bug fix, Mistral runs were showing +20% worse loss. After fixing the read_weights initialization (was silently compressing hidden states to 33% of original magnitude across 32 layers), the loss gap dropped to ~1-2% on this tiny dataset.

HF Space

Would genuinely value feedback on the approach — especially from anyone who has worked on gradient stability or Sinkhorn-based methods.

1 Like