- Background: Use deepspeed (use ZeRO-1) for multi-node training, save optimizers to resume training.
- I find that
accelerate.save_state
only save the partitioned optimizer state on each machine. if we use 32 gpus (4 node), each node only saves 8 partitions. For the 1st node, it savesbf16_zero_pp_rank_(0-7)_mp_rank_00_optim_states.pt
. When callload_state
to resume training for deepspeed, the 1 st node reports the issue ofThe following zero checkpoints paths are missing: ['./output/test/step_5/pytorch_model/bf16_zero_pp_rank_8_mp_rank_00_optim_states.pt', './output/test/step_5/pytorch_model/bf16_zero_pp_rank_9_mp_rank_00_optim_states.pt', './output/test/step_5/pytorch_model/bf16_zero_pp_rank_10_mp_rank_00_optim_states.pt', './output/test/step_5/pytorch_model/bf16_zero_pp_rank_11_mp_rank_00_optim_states.pt', './output/test/step_5/pytorch_model/bf16_zero_pp_rank_12_mp_rank_00_optim_states.pt', './output/test/step_5/pytorch_model/bf16_zero_pp_rank_13_mp_rank_00_optim_states.pt', './output/test/step_5/pytorch_model/bf16_zero_pp_rank_14_mp_rank_00_optim_states.pt', './output/test/step_5/pytorch_model/bf16_zero_pp_rank_15_mp_rank_00_optim_states.pt', './output/test/step_5/pytorch_model/bf16_zero_pp_rank_16_mp_rank_00_optim_states.pt', './output/test/step_5/pytorch_model/bf16_zero_pp_rank_17_mp_rank_00_optim_states.pt', './output/test/step_5/pytorch_model/bf16_zero_pp_rank_18_mp_rank_00_optim_states.pt', './output/test/step_5/pytorch_model/bf16_zero_pp_rank_19_mp_rank_00_optim_states.pt', './output/test/step_5/pytorch_model/bf16_zero_pp_rank_20_mp_rank_00_optim_states.pt', './output/test/step_5/pytorch_model/bf16_zero_pp_rank_21_mp_rank_00_optim_states.pt', './output/test/step_5/pytorch_model/bf16_zero_pp_rank_22_mp_rank_00_optim_states.pt', './output/test/step_5/pytorch_model/bf16_zero_pp_rank_23_mp_rank_00_optim_states.pt', './output/test/step_5/pytorch_model/bf16_zero_pp_rank_24_mp_rank_00_optim_states.pt', './output/test/step_5/pytorch_model/bf16_zero_pp_rank_25_mp_rank_00_optim_states.pt', './output/test/step_5/pytorch_model/bf16_zero_pp_rank_26_mp_rank_00_optim_states.pt', './output/test/step_5/pytorch_model/bf16_zero_pp_rank_27_mp_rank_00_optim_states.pt', './output/test/step_5/pytorch_model/bf16_zero_pp_rank_28_mp_rank_00_optim_states.pt', './output/test/step_5/pytorch_model/bf16_zero_pp_rank_29_mp_rank_00_optim_states.pt', './output/test/step_5/pytorch_model/bf16_zero_pp_rank_30_mp_rank_00_optim_states.pt', './output/test/step_5/pytorch_model/bf16_zero_pp_rank_31_mp_rank_00_optim_states.pt']
It requires to merge the all states from different nodes. Any easier solution to synchronize them?