[Research/Discussion] Depth-agnostic stability for residual models (no extra norms, no tuning). Is this useful to you?

Hi all

TL;DR: I’m exploring a training stability principle (Master’s research) that keeps very deep residual-style models (ResNets, Transformers) stable without extra normalization layers, without new hyperparams, and without knowing the depth in advance.
Early runs look promising, but I’d love practical feedback on utility and the fairest benchmarks.

Prior check (CPU): validated formulas/invariants via Golang CPU tests, seeing high-depth stability on toy setups.

Preliminary GPU result: a 4-hour real training run (GPT-2, 96 layers, seq-len 256) stayed stable (no divergence, steady throughput). Couldn’t extend due to local HW limit (RTX 4060).

Trade-off: ~10–20% more GPU time than my standard setup (and a DeepNorm baseline) because the prototype isn’t optimized yet; I expect <5% overhead with straightforward engineering.

Caveat: research prototype under validation; results may vary and the code has rough edges. I’m not sharing the implementation details yet while I finish math/repro checks.

Questions for the community:

  1. Would you use a plug-in stability rule (no new hyperparams) if it reliably prevents divergence—even with a modest speed cost?

  2. Which benchmark would you trust most: synthetic notebook, CIFAR-10 (CV), or a small NLP task?

  3. For you, what matters more here: speed, final accuracy, or reliability (no NaNs/divergence)?

Thanks for reading — your feedback will decide whether I publish the repro notebook and pursue this direction further.

1 Like

Interesting direction. The depth-agnostic property is valuable precisely because most stability fixes (BatchNorm, LayerNorm variants, careful weight init) require knowing the depth or architecture in advance, which limits generalizability.

One thing your results surface that’s relevant to fine-tuning (not just pretraining): the instability at depth isn’t isolated to forward-pass signal propagation. The same compounding problem appears in the backward pass during LoRA/QLoRA fine-tuning — gradient norms accumulate across adapter update steps in a way that fixed max_grad_norm thresholds can’t fully manage, because the threshold is set before the run sees its own norm distribution.

We ran into a reproducible case of this on Mistral-7B QLoRA: a gradient norm spike at step ~44 (gn=15.28 vs normal ~1.0) that appeared across every run with the same seed. The fix that worked was computing a rolling z-score over recent gradient norms and only clipping statistical outliers — essentially a depth-agnostic-equivalent idea but applied to the backward pass norm history rather than the forward-pass signal.

Would be curious whether your stability principle has implications for gradient norm behavior during fine-tuning specifically, or if it’s primarily a pretraining/architecture-depth phenomenon. The 10-20% GPU overhead trade-off you mention might be acceptable for pretraining but the calculus is different for short fine-tuning runs.

Free tool if useful for comparison data points: Crma Fine Tuner - a Hugging Face Space by Fourwheels2512

1 Like