I am trying the language model pre-trained by using run_mlm.py. I want to add the mixture of expert (MoE) integrated with the original Bert-base model. Specifically, I reuse the MoELayer implemented by DeepSpeed and add it to the BertForMaskedLM. From the document of DeepSpeed, I find the training of DeepSpeed requires to call some functions like this:
However, I just want to reuse the MoE model implemented by DeepSpeed and maintain the training behaviors of Huggingface. Currently, I ignore calling this function and directly pass the language model (with DeepSpeed model) into the Trainer(). Although it runs successfully, my question is that does this incurs some potential dangers?
I can’t answer your question, but I’m a bit confused. From what I have read about the MoE layer, the point of it is to facilitate the use (Mixture) of many different models (Experts) concurrently, but you say you are using the MoE layer on top of a single BertForMaskedLM model.
What are you hoping the MoE layer will do for you? Does it have some other advantages?
Hi, rgwatwormhill. Sorry For I didn’t mention it clearly. Actually, I combine the MoE with the FFN module, following the design of Switch Transformer. This paper claimed this design can accelerate the training efficiency. Even though I haven’t observed in my experiments…
I also plan to extend the huggingface transformer with deepspeed MoE. Do you have any successful experience in doing so? Can the models run and train successfully by simply adding the MoE layer to the FFN layer?