Model checkpoints on a worker node in multi-node training

weiqis · June 7, 2023, 6:08pm

Hi,

I’m training a large LM on two nodes where each node has 8 GPUs. I utilized pytorch’s FSDP for distributed training. I launched the training using torchrun:

torchrun \
    --nproc_per_node=8 \
    --nnodes=2 \
    --node_rank=0 \ # 1 for the worker node
    --master_addr=xxx \
    --master_port=9901 \
    run_clm.py <other args>

The training job runs smoothly. I set the trainer to save checkpoints every 100 steps just in case I can resume from the latest checkpoint if any error occurs. The model checkpoints are saved successfully on the master node. However, on the worker node only rng_state_[8-15].pth files are saved, with each being just 24K. I tried to resume the model training from this checkpoint but it failed on the worker node.

My questions are:

if I copy everything in the checkpoint directory from the master node to the worker node except those rng_state_[0-7].pth files, does it give the correct state and behavior if I resume from the checkpoint?
is there a way to save full checkpoints on worker nodes?

Here is my env settings:

torch==2.0.0
transformers=4.29.2
cuda==11.8

Thanks!

Topic		Replies	Views
Questions about deepspeed resume training 🤗Accelerate	2	2056	October 21, 2022
Deepspeed resume training from saved states 🤗Accelerate	0	1267	September 8, 2022
FSDP training not saving the best checkpoint and load from checkpoint fails 🤗Transformers	0	788	January 23, 2024
How do I load a trained checkpoint model? 🤗Transformers	1	61	May 20, 2025
DeepSpeed Further Training Issue Beginners	2	287	November 25, 2023

Model checkpoints on a worker node in multi-node training

Related topics