Hi all
TL;DR: I’m exploring a training stability principle (Master’s research) that keeps very deep residual-style models (ResNets, Transformers) stable without extra normalization layers, without new hyperparams, and without knowing the depth in advance.
Early runs look promising, but I’d love practical feedback on utility and the fairest benchmarks.
Prior check (CPU): validated formulas/invariants via Golang CPU tests, seeing high-depth stability on toy setups.
Preliminary GPU result: a 4-hour real training run (GPT-2, 96 layers, seq-len 256) stayed stable (no divergence, steady throughput). Couldn’t extend due to local HW limit (RTX 4060).
Trade-off: ~10–20% more GPU time than my standard setup (and a DeepNorm baseline) because the prototype isn’t optimized yet; I expect <5% overhead with straightforward engineering.
Caveat: research prototype under validation; results may vary and the code has rough edges. I’m not sharing the implementation details yet while I finish math/repro checks.
Questions for the community:
-
Would you use a plug-in stability rule (no new hyperparams) if it reliably prevents divergence—even with a modest speed cost?
-
Which benchmark would you trust most: synthetic notebook, CIFAR-10 (CV), or a small NLP task?
-
For you, what matters more here: speed, final accuracy, or reliability (no NaNs/divergence)?
Thanks for reading — your feedback will decide whether I publish the repro notebook and pursue this direction further.