I am trying the language model pre-trained by using run_mlm.py. I want to add the mixture of expert (MoE) integrated with the original Bert-base model. Specifically, I reuse the
MoELayer implemented by DeepSpeed and add it to the
BertForMaskedLM. From the document of DeepSpeed, I find the training of DeepSpeed requires to call some functions like this:
model_engine, optimizer, _, _ = deepspeed.initialize(args=cmd_args, model=net, model_parameters=net.parameters())
However, I just want to reuse the MoE model implemented by DeepSpeed and maintain the training behaviors of Huggingface. Currently, I ignore calling this function and directly pass the language model (with DeepSpeed model) into the
Trainer(). Although it runs successfully, my question is that does this incurs some potential dangers?