I’m training a large LM on two nodes where each node has 8 GPUs. I utilized pytorch’s FSDP for distributed training. I launched the training using
torchrun \ --nproc_per_node=8 \ --nnodes=2 \ --node_rank=0 \ # 1 for the worker node --master_addr=xxx \ --master_port=9901 \ run_clm.py <other args>
The training job runs smoothly. I set the trainer to save checkpoints every 100 steps just in case I can resume from the latest checkpoint if any error occurs. The model checkpoints are saved successfully on the master node. However, on the worker node only
rng_state_[8-15].pth files are saved, with each being just 24K. I tried to resume the model training from this checkpoint but it failed on the worker node.
My questions are:
- if I copy everything in the checkpoint directory from the master node to the worker node except those
rng_state_[0-7].pthfiles, does it give the correct state and behavior if I resume from the checkpoint?
- is there a way to save full checkpoints on worker nodes?
Here is my env settings:
torch==2.0.0 transformers=4.29.2 cuda==11.8