src/transformers/trainer.py, as I understand it, supports deepspeed but not megatron-lm. Is it right?
When I tried to make it support megatron-lm, I encountered some problems:
-
When and where to load megatron-lm ckpt?
A. function _inner_training_loop
B. prepare_model in accelerate/utils/megatron_lm.py
C. other good way -
Only the loss of last_pp_rank is valid in Megatron-LM, while the Trainer takes the mean of ranks. When and where to deal with loss of megatron?
A. function _nested_gather in trainer.py
B. GitHub - NVIDIA/Megatron-LM: Ongoing research training transformer models at scale
C. other good ways -
Trainer calls accelerate.prepare for model and dataloader respectively, which will initialize megatron-lm repeatedly. How to avoid this problem
A. Modify initialization in Megatron-LM/megatron/global_vars.py
B. Modify function initialize in accelerate/utils/megatron_lm.py
C. other good ways -
In order to support the given dataset with different seq-length for each batch, when and where to padding it to meet requirements of megatron. (All batches have same seq-length)
A. Accelerate
B. Megatron-LM
C. other good ways