How to run hf MoE series model in an expert parallel manner?

marsggbo · April 7, 2024, 8:33am

Hugging Face offers a variety of Mixture of Experts (MoE) models, such as Switch Transformer and Mixtral-MoE, ready for deployment. However, a notable limitation is that many of these models lack support for expert parallelism, meaning the distribution of experts across multiple devices isn’t handled automatically. For example, if a layer contains 8 experts and you have 4 GPUs at your disposal, ideally, you’d want each GPU to manage two experts to ensure efficient utilization. Achieving this level of parallelism seamlessly, without significant effort, poses a challenge. Is there a straightforward method to implement this form of parallel distribution with minimal complexity?

Topic		Replies	Views
What should I do if I want to use model from DeepSpeed DeepSpeed	5	1631	April 6, 2024
Model Parallelism and Pipelining for Model Training Beginners	3	3334	April 11, 2024
Easily build your own MoE LLM! Show and Tell	0	711	April 15, 2024
\multi-node finetuning with Trainer 🤗Transformers	0	478	July 27, 2022
Paper Notes: Deepspeed Mixture of Experts Research	2	2206	January 20, 2022

How to run hf MoE series model in an expert parallel manner?

Related topics