Hi everyone ![]()
I just published a long-form blog post digging into why Muon often appears “unstable” or “slow” when people try to scale it to large models, and why this is not primarily an implementation or tuning issue.
TL;DR:
Muon does scale — but only if you respect the geometry changes that appear at scale. More GPUs, better kernels, or cleaner distributed code can’t compensate for the wrong math regime.
What the post covers
This is not a “how to use Muon” tutorial. It’s a failure-mode analysis based on reproductions and controlled simulations.
I focus on two emergent failure modes that do not show up in small MLPs:
-
Paralysis (Vanishing Updates)
-
Newton–Schulz orthogonalization enforces unit singular values.
-
As matrix dimensions grow, per-parameter update RMS shrinks as (1/\sqrt{D}).
-
No learning-rate schedule or hardware optimization can fix:
[
\lim_{D \to \infty} \frac{1}{\sqrt{D}} = 0
] -
The fix is a dimensional correction, not a heuristic.
-
-
Drift (Unbounded Weight Growth)
-
Full-rank Muon updates behave like a high-dimensional random walk.
-
Without proper weight decay, weight norms inevitably drift until numerical failure.
-
From a statistical view, this is a unit-root problem; weight decay enforces stationarity.
-
At scale, weight decay is structural stabilization, not regularization.
-
I also explicitly separate:
-
Engineering scaling problems (communication, sharding, overlap)
-
Geometric scaling problems (update energy, variance growth, numerical limits)
Fast divergence is still divergence.
Why I wrote this
I used to think scaling meant:
more GPUs + bigger models + some hyperparameter tuning
After reproducing Muon’s “scalable” results from multiple angles, it became clear that scaling is a phase change — optimizers enter new regimes where small-model intuition breaks.
Muon is a particularly clean case study, but the lesson is general.
Links
Blog post: Scaling Is Not Plug-and-Play: What Muon Teaches Us About Optimizers at Scale
Code(simulations will be included shortly):
https://huggingface.co/datasets/bird-of-paradise/muon-distributed-reproducibility
Reference paper: Muon Is Scalable for LLM Training (Moonshot AI)
Discussion welcome
I’d especially love feedback or counterexamples from folks who:
- have run Muon (or similar full-rank optimizers) at scale.
- trained wide or long-context models where update RMS collapsed,
- hit norm drift or bf16 instability despite “reasonable” hyperparameters,
- or had to introduce explicit scaling / decay to keep runs numerically stable,
Happy to clarify assumptions, share reproduction details, or discuss where this analysis might break down in real runs.