How to run trainer.py with megatron_lm_plugin

xiaojunjie · August 15, 2023, 8:27am

src/transformers/trainer.py, as I understand it, supports deepspeed but not megatron-lm. Is it right?
When I tried to make it support megatron-lm, I encountered some problems：

When and where to load megatron-lm ckpt?
A. function _inner_training_loop
B. prepare_model in accelerate/utils/megatron_lm.py
C. other good way
Only the loss of last_pp_rank is valid in Megatron-LM, while the Trainer takes the mean of ranks. When and where to deal with loss of megatron?
A. function _nested_gather in trainer.py
B. GitHub - NVIDIA/Megatron-LM: Ongoing research training transformer models at scale
C. other good ways
Trainer calls accelerate.prepare for model and dataloader respectively, which will initialize megatron-lm repeatedly. How to avoid this problem
A. Modify initialization in Megatron-LM/megatron/global_vars.py
B. Modify function initialize in accelerate/utils/megatron_lm.py
C. other good ways
In order to support the given dataset with different seq-length for each batch, when and where to padding it to meet requirements of megatron. (All batches have same seq-length)
A. Accelerate
B. Megatron-LM
C. other good ways

Topic		Replies	Views
Subject: Issues with Custom Model Saving Behavior Using Trainer Class in LVLM Training 🤗Transformers	0	120	April 8, 2024
Exact difference between Transformers' and Accelerate's DeepSpeed integrations? DeepSpeed	5	809	February 13, 2024
Unable to train model (Loss is 0.000000) DeepSpeed	2	1093	October 17, 2023
Besides writing your own training loop, is there any other advantage for using it with deepspeed? 🤗Accelerate	2	585	July 4, 2023
Best practice to run DeepSpeed DeepSpeed	2	1559	December 25, 2023

How to run trainer.py with megatron_lm_plugin

Related topics