To resume deepspeed zero-1 multi-node training, I use accelerate.save_state
to save the optimizer states. However, each node only saves one partitioned states (e.g., 1/4 states for 4 machines). When loading states for resuming training, the load_state
reports missing other checkpoints The following zero checkpoints paths are missing: ['./output/test/step_5/pytorch_model/bf16_zero_pp_rank_{8-31}_mp_rank_00_optim_states.pt', ...]
, and the subprocesses are killed.
It requires to manually synchronize the partitioned states from multiple nodes. Any easier solution to it?