Hugging Face offers a variety of Mixture of Experts (MoE) models, such as Switch Transformer and Mixtral-MoE, ready for deployment. However, a notable limitation is that many of these models lack support for expert parallelism, meaning the distribution of experts across multiple devices isn’t handled automatically. For example, if a layer contains 8 experts and you have 4 GPUs at your disposal, ideally, you’d want each GPU to manage two experts to ensure efficient utilization. Achieving this level of parallelism seamlessly, without significant effort, poses a challenge. Is there a straightforward method to implement this form of parallel distribution with minimal complexity?