Model checkpoints on a worker node in multi-node training


I’m training a large LM on two nodes where each node has 8 GPUs. I utilized pytorch’s FSDP for distributed training. I launched the training using torchrun:

torchrun \
    --nproc_per_node=8 \
    --nnodes=2 \
    --node_rank=0 \ # 1 for the worker node
    --master_addr=xxx \
    --master_port=9901 \ <other args>

The training job runs smoothly. I set the trainer to save checkpoints every 100 steps just in case I can resume from the latest checkpoint if any error occurs. The model checkpoints are saved successfully on the master node. However, on the worker node only rng_state_[8-15].pth files are saved, with each being just 24K. I tried to resume the model training from this checkpoint but it failed on the worker node.

My questions are:

  1. if I copy everything in the checkpoint directory from the master node to the worker node except those rng_state_[0-7].pth files, does it give the correct state and behavior if I resume from the checkpoint?
  2. is there a way to save full checkpoints on worker nodes?

Here is my env settings: