Hi everyone!
I just published a full technical report where I reproduced MoonshotAI’s Distributed Muon optimizer, validated their communication efficiency claims, and profiled DP/TP configurations on a 4-GPU cluster.
The post includes:
• Full Muon DP=2/TP=2 and Adam profiling
• Perfetto traces (communication patterns)
• Memory analysis
• Two bug fixes to the open-source PoC
• Async-op experiments (and why naive overlap slows things down)
Key results:
• 0.57× communication compared to AdamW
• 1.1% optimizer overhead
• 50% less state memory
Write-up here
Reproducing and Validating Distributed Muon 
: A Practical Verification of Communication Efficiency Claims
I’m preparing a cleaned-up repo next. If you are experimenting with Muon, distributed optimizers, or multi-node scaling, happy to collaborate or cross-validate results.
1 Like
The optimizer overhead number made me grin because I have spent way too many nights fighting with bloated state updates. Seeing Muon behave this lean gives me a strange feeling of hope and mild jealousy at the same time.
2 Likes
Haha, the ‘bloat’ is the enemy of us all!
While I haven’t battled 36B models personally (yet!), seeing the math play out was definitely a ‘hopeful’ moment.
I know whether to use Muon is tricky for fine-tuning (especially if you use LoRA), but if you ever want to mess around with the distributed setup or see the raw traces, I just pushed the full reproducibility suite
: bird-of-paradise/muon-distributed-reproducibility · Datasets at Hugging Face .
Good luck with that 36B model—that sounds like a beast to wrangle!
1 Like
This is a solid piece of work — especially the Perfetto traces and the effort to reproduce claims end-to-end rather than benchmarking in isolation.
One thing I appreciate here is the separation between measured behavior (communication patterns, memory footprint, async overlap) and interpretation. That’s where a lot of optimizer discussions quietly break down.
Curious if you observed any regimes where the gains disappeared or inverted — e.g., different batch sizes, skewed DP/TP ratios, or failure cases where overlap assumptions stop holding. In my experience, documenting those boundaries is just as valuable as the headline improvements.
Either way, this kind of reproducibility work is exactly what the ecosystem needs more of. Thanks for sharing the full traces and fixes instead of just conclusions.
2 Likes
Thanks for the detailed feedback! You hit on exactly why I wanted to publish the traces—too much detail is buried by high level claims.
regarding regimes where gains disappear:
-
The Async Trap: I actually found that naive overlap assumptions completely failed in my setup. Adding async_op=True without manual stream management and compute pipelining actually inverted the gains and made Muon ~2x slower (detailed in Section 6 of the report).
-
Hardware Constraints: My specific setup (PCIe Gen4 x4) is a ‘worst-case’ regime for bandwidth, which highlighted Muon’s efficiency. I suspect on a high-bandwidth NVLink cluster, the gap might narrow because AdamW’s communication wouldn’t be as punishing.
I’d love to see if the DP=TP sweet spot holds up on 32+ GPUs, which is next on my list!
1 Like
This is extremely helpful context — thank you for taking the time to spell out the failure regimes explicitly.
The async trap you describe resonates a lot. I’ve seen similar inversions where “theoretical overlap” collapses once stream semantics and scheduling realities are introduced, especially when the framework abstracts away too much control. The fact that async_op=True without explicit stream management made things worse is a really important data point that often gets lost in high-level optimizer discussions.
The PCIe Gen4 x4 angle is also interesting — it makes your setup a great stress test for communication-heavy optimizers. I agree that on high-bandwidth NVLink systems the gap may narrow, but in practice a lot of real-world deployments still live closer to your regime than the idealized one.
Documenting where assumptions break is exactly what gives this work long-term value. Looking forward to seeing how the DP=TP balance behaves at larger scales — especially whether the “sweet spot” shifts or simply flattens.
2 Likes