MoE LLMs (like Mixtral) have set a new bar for efficient scaling. But all open MoEs route at the token level, with expert specialization emerging implicitly.
Recent research (TaskMoE, DomainMoE, THOR-MoE, GLaM) explores explicit routing by domain and even subdomain. This enables:
Targeted upgrades (swap in a better “math” or “literature” expert without retraining the whole model)
More interpretable model internals
Modularity that aligns with how orchestrators (AutoGen, CrewAI, MCP) are evolving
What might this look like for Mistral?
Expert groups per domain (English, math, code, etc.)
Hierarchies within domains (e.g., arithmetic → algebra → calculus), potentially with meta-experts that arbitrate or combine outputs
A possible “expert registry” for community or enterprise swapping/upgrading
This isn’t trivial. Some questions:
How should the gating and training be handled to avoid catastrophic forgetting or interface mismatch?
What’s the best way to benchmark performance of swapped modules?
Are there security or trust issues with open expert modules, and how do other plugin/package systems handle it?
Who’s working on this already? Any public code, experiments, or ideas?
Links:
TaskMoE:
DomainMoE:
THOR-MoE:
AutoGen: https://github.com/microsoft/autogen
CrewAI: https://github.com/joaomdmoura/crewAI
ModelContextProtocol: https://github.com/modelcontextprotocol/servers
Would love thoughts, critique, and collaboration. Is this plausible as the next step for Mixtral (or other open MoEs)? What would it take to make this real?
TL;DR
Is it time for modular, upgradeable, domain-aware MoE in open models like Mistral? What’s missing—and who’s already working on it?