Study Group: Implementing a Scalable, FSDP-Compatible Muon Optimizer

Hi everyone,

I’ve recently been diving deep into the OLMo codebase and submitted a PR to add a DDP-compatible version of the Muon optimizer(with AdamW backup). The FSDP-compatible version that I’m working on is logically correct but introduces significant communication overhead. This process led me to the fascinating Moonshot AI paper, “Muon is Scalable for LLM Training,” where they successfully trained models over 1T parameters using this optimizer.

This raises a key challenge: the standard Muon implementation requires non-element-wise operations (matrix orthogonalization), which presents a communication overhead challenge for sharded data parallelism strategies like FSDP. The Moonshot paper proves a communication-efficient, scalable solution exists, likely using advanced techniques like coalesced, non-blocking communication collectives.

I’m looking to form a small study group to:

  1. Deep-dive into the Moonshot paper to understand their distributed implementation.

  2. Explore how to implement these techniques in PyTorch.

  3. Collaborate on a proof-of-concept, FSDP-compatible Muon optimizer that is both logically correct and communication-efficient.

This is a great opportunity for anyone interested in the intersection of optimizer design and large-scale distributed systems. If you’d like to join in on reverse-engineering this and building a cool open-source tool, please reply here or reach out!

Looking forward to collaborating. :handshake:

1 Like