Hi,
I’m training a large LM on two nodes where each node has 8 GPUs. I utilized pytorch’s FSDP for distributed training. I launched the training using torchrun
:
torchrun \
--nproc_per_node=8 \
--nnodes=2 \
--node_rank=0 \ # 1 for the worker node
--master_addr=xxx \
--master_port=9901 \
run_clm.py <other args>
The training job runs smoothly. I set the trainer to save checkpoints every 100 steps just in case I can resume from the latest checkpoint if any error occurs. The model checkpoints are saved successfully on the master node. However, on the worker node only rng_state_[8-15].pth
files are saved, with each being just 24K. I tried to resume the model training from this checkpoint but it failed on the worker node.
My questions are:
- if I copy everything in the checkpoint directory from the master node to the worker node except those
rng_state_[0-7].pth
files, does it give the correct state and behavior if I resume from the checkpoint? - is there a way to save full checkpoints on worker nodes?
Here is my env settings:
torch==2.0.0
transformers=4.29.2
cuda==11.8
Thanks!