Study Group: Implementing a Scalable, FSDP-Compatible Muon Optimizer

bird-of-paradise · September 24, 2025, 2:51am

Hi everyone,

I’ve recently been diving deep into the OLMo codebase and submitted a PR to add a DDP-compatible version of the Muon optimizer(with AdamW backup). The FSDP-compatible version that I’m working on is logically correct but introduces significant communication overhead. This process led me to the fascinating Moonshot AI paper, “Muon is Scalable for LLM Training,” where they successfully trained models over 1T parameters using this optimizer.

This raises a key challenge: the standard Muon implementation requires non-element-wise operations (matrix orthogonalization), which presents a communication overhead challenge for sharded data parallelism strategies like FSDP. The Moonshot paper proves a communication-efficient, scalable solution exists, likely using advanced techniques like coalesced, non-blocking communication collectives.

I’m looking to form a small study group to:

Deep-dive into the Moonshot paper to understand their distributed implementation.
Explore how to implement these techniques in PyTorch.
Collaborate on a proof-of-concept, FSDP-compatible Muon optimizer that is both logically correct and communication-efficient.

This is a great opportunity for anyone interested in the intersection of optimizer design and large-scale distributed systems. If you’d like to join in on reverse-engineering this and building a cool open-source tool, please reply here or reach out!

Looking forward to collaborating.

prnvpwr2612 · September 24, 2025, 7:22pm

ehmm that is some efforts you’ve taken to ideate this.
Task 1 and 2 seems something I’d easily fit in to work on.
3 will require some time and some amount of research to get going with.
Really loved the post, just wondering if you have some sort of screening for people who join this group of yours.

bird-of-paradise · September 25, 2025, 11:57pm

Thank you for your interest, and welcome to the Hugging Face forum!

I set up a Discord server for the study group, please see my reply to this thread for more details.

bird-of-paradise · September 25, 2025, 11:59pm

Great news, everyone! I’ve set up a Discord server for our study group to coordinate and dive into the paper together.

Discord Invite Link: BirdOfParadise

Study group channel: #muon-is-scalable-study-group

My plan is to keep the initial phase (studying the paper and exploring the PyTorch APIs) open to everyone. The main goal is to create a supportive space where we can all learn from each other. After we’ve built a shared understanding, we can form a core team for the proof-of-concept implementation.

Let’s make sure our discussions are respectful, and that we give credit for shared ideas as we go. Looking forward to collaborating with you all!

Edit: updated the link above.

giliardgodoi · September 30, 2025, 11:21pm

I’m not sure if I’ll be able to contribute to the final implementation, but may I participate somehow?

bird-of-paradise · October 1, 2025, 12:11am

of course!

Topic		Replies	Views
My Muon Replication Journey — From Distributed Optimizers to a No-BS Training Glossary 🧩 Show and Tell	2	52	October 28, 2025
[Tutorial] Understanding and Implementing the Muon Optimizer Show and Tell	1	942	September 17, 2025
How to use FSDP + DPP in Trainer 🤗Transformers	1	1031	April 24, 2023
Run_mlm.py using --sharded_ddp "zero_dp_3 offload" gives AssertionError Intermediate	3	1178	April 21, 2021
Pre-training with Lamb optimizer Research	7	4365	December 28, 2020

Study Group: Implementing a Scalable, FSDP-Compatible Muon Optimizer

Related topics