Study Group: Implementing a Scalable, FSDP-Compatible Muon Optimizer

Hi everyone,

I’ve recently been diving deep into the OLMo codebase and submitted a PR to add a DDP-compatible version of the Muon optimizer(with AdamW backup). The FSDP-compatible version that I’m working on is logically correct but introduces significant communication overhead. This process led me to the fascinating Moonshot AI paper, “Muon is Scalable for LLM Training,” where they successfully trained models over 1T parameters using this optimizer.

This raises a key challenge: the standard Muon implementation requires non-element-wise operations (matrix orthogonalization), which presents a communication overhead challenge for sharded data parallelism strategies like FSDP. The Moonshot paper proves a communication-efficient, scalable solution exists, likely using advanced techniques like coalesced, non-blocking communication collectives.

I’m looking to form a small study group to:

  1. Deep-dive into the Moonshot paper to understand their distributed implementation.

  2. Explore how to implement these techniques in PyTorch.

  3. Collaborate on a proof-of-concept, FSDP-compatible Muon optimizer that is both logically correct and communication-efficient.

This is a great opportunity for anyone interested in the intersection of optimizer design and large-scale distributed systems. If you’d like to join in on reverse-engineering this and building a cool open-source tool, please reply here or reach out!

Looking forward to collaborating. :handshake:

2 Likes

ehmm that is some efforts you’ve taken to ideate this.
Task 1 and 2 seems something I’d easily fit in to work on.
3 will require some time and some amount of research to get going with.
Really loved the post, just wondering if you have some sort of screening for people who join this group of yours.

2 Likes

Thank you for your interest, and welcome to the Hugging Face forum!

I set up a Discord server for the study group, please see my reply to this thread for more details.

1 Like

Great news, everyone! I’ve set up a Discord server for our study group to coordinate and dive into the paper together.

Discord Invite Link: BirdOfParadise

Study group channel: #muon-is-scalable-study-group

My plan is to keep the initial phase (studying the paper and exploring the PyTorch APIs) open to everyone. The main goal is to create a supportive space where we can all learn from each other. After we’ve built a shared understanding, we can form a core team for the proof-of-concept implementation.

Let’s make sure our discussions are respectful, and that we give credit for shared ideas as we go. Looking forward to collaborating with you all! :handshake:


Edit: updated the link above.

2 Likes

I’m not sure if I’ll be able to contribute to the final implementation, but may I participate somehow?

2 Likes

of course! :slight_smile:

1 Like