Deepspeed resume training from saved states

To resume deepspeed zero-1 multi-node training, I use accelerate.save_state to save the optimizer states. However, each node only saves one partitioned states (e.g., 1/4 states for 4 machines). When loading states for resuming training, the load_state reports missing other checkpoints The following zero checkpoints paths are missing: ['./output/test/step_5/pytorch_model/bf16_zero_pp_rank_{8-31}', ...], and the subprocesses are killed.

It requires to manually synchronize the partitioned states from multiple nodes. Any easier solution to it?