Hey everyone! 
I’m excited to share a comprehensive tutorial I’ve created on understanding and implementing the Muon optimizer - a recent innovation that’s showing impressive performance improvements over traditional optimizers like AdamW and SGD.
What is Muon?
Muon (MomentUm Orthogonalized by Newton-Schulz) was introduced by Keller Jordan in October 2024 and has quickly gained attention in the optimization community. It specifically targets matrix parameters in neural networks, using Newton-Schulz iterations to orthogonalize gradient updates.
What does this tutorial cover?
-
The Problem: Why traditional optimizers struggle with skewed singular value distributions
-
The Solution: How Muon’s matrix orthogonalization addresses this fundamental issue
-
Practical Implementation: A clean, educational implementation in PyTorch
-
Performance Analysis: Experimental results showing Muon’s benefits
-
Lessons Learned: Practical insights from implementing and using Muon
Key Findings
In my experiments, Muon significantly outperformed traditional optimizers:
-
On MNIST, Muon achieved 34% lower loss than AdamW after just 3 epochs
-
On CIFAR-10, Muon reached 80.79% accuracy vs. AdamW’s 71.66% after 5 epochs
-
All this with minimal computational overhead on modern hardware
Why I created this
While exploring Muon, I found there was a gap between the mathematical description in research papers and practical implementation details. This tutorial aims to bridge that gap, providing both theoretical understanding and a working implementation.
I was particularly struck by how a relatively simple mathematical insight - orthogonalizing gradient updates to better utilize the full parameter space - could lead to such significant performance improvements.
Check it out!
Muon Tutorial on Hugging Face
The repository includes a full README explanation and a Colab notebook where you can run all the experiments yourself.
I’d love to hear your thoughts, questions, or experiences if you try Muon in your own projects!
4 Likes
Suggested Draft for HF Forum Reply
Update & New Advanced Notebook!
Hey everyone, wanted to share a significant update to this tutorial for those interested in applying Muon to large-scale, distributed systems.
I’ve added a new, standalone notebook, MuonForOLMo.ipynb. This implementation is FSDP-compatible and is adapted from my pending PR to AI2’s OLMo repository.
Key features in the new notebook:
-
Distributed Training Ready: Full FSDP compatibility for multi-GPU setups.
-
Hybrid MuonW Optimizer: A robust implementation that uses Muon for matrix parameters and AdamW as a fallback for everything else (e.g., embeddings, biases).
-
Advanced Metric Tracking: Includes a new method for detailed monitoring of the optimizer’s state during training.
The goal is to bridge the gap from the original educational implementation to a more practical, production-ready example.
You can find the new notebook in the“Advanced Implementation” section of the main tutorial page.
Looking forward to any feedback!
1 Like
Sequel Drop — “The Muon is Scalable” (CPU-Friendly Edition)
Following the momentum of my original tutorial, Understanding the Muon Optimizer (1300 + downloads in first 2 months🎉), I’ve just released its long-awaited sequel:
bird-of-paradise/muon_distributed
This new reverse-engineering breakdown(CPU-Friendly, Tutorial-Style) is the expert-level, systems-engineering companion to the first one — a full, annotated rewrite of Moonshot AI’s “Muon is Scalable for LLM Training” proof-of-concept, adapted to run on plain CPU/Gloo.
Highlights 
• Runs anywhere – no GPU needed (great for broke-but-curious builders
)
• Demonstrates end-to-end DP × TP orchestration with ZeRO-1 sharding
• Shows the full (DP gather → TP gather) → Run Math → ( TP shard → DP shard ) flow
• Includes fixes and readability improvements over the Moonshot PoC
• Companion to my Medium series “The Turtle Speed Breakthrough 
” (Parts 1-3)
This CPU version validates the logic, symmetry, and sharding choreography of Muon’s distributed backbone — the blueprint behind scalability.
Next up: testing coalesced all_gather for true multi-GPU scaling (8 × GPUs target).
If you have spare compute —or just want to join the distributed chaos
— hop in the study group Discord (link in main thread).
Because sometimes the best way to learn distributed nightmares is to get your hands dirty and your eyes crossed.
#Muon #DistributedComputing #PyTorch #ZeRO #TensorParallelism #AIResearch #DeepLearning #HuggingFace #OpenSource #Tutorial #MachineLearning
1 Like