Questions about deepspeed resume training

  1. Background: Use deepspeed (use ZeRO-1) for multi-node training, save optimizers to resume training.
  2. I find that accelerate.save_state only save the partitioned optimizer state on each machine. if we use 32 gpus (4 node), each node only saves 8 partitions. For the 1st node, it saves bf16_zero_pp_rank_(0-7) When call load_state to resume training for deepspeed, the 1 st node reports the issue of The following zero checkpoints paths are missing: ['./output/test/step_5/pytorch_model/', './output/test/step_5/pytorch_model/', './output/test/step_5/pytorch_model/', './output/test/step_5/pytorch_model/', './output/test/step_5/pytorch_model/', './output/test/step_5/pytorch_model/', './output/test/step_5/pytorch_model/', './output/test/step_5/pytorch_model/', './output/test/step_5/pytorch_model/', './output/test/step_5/pytorch_model/', './output/test/step_5/pytorch_model/', './output/test/step_5/pytorch_model/', './output/test/step_5/pytorch_model/', './output/test/step_5/pytorch_model/', './output/test/step_5/pytorch_model/', './output/test/step_5/pytorch_model/', './output/test/step_5/pytorch_model/', './output/test/step_5/pytorch_model/', './output/test/step_5/pytorch_model/', './output/test/step_5/pytorch_model/', './output/test/step_5/pytorch_model/', './output/test/step_5/pytorch_model/', './output/test/step_5/pytorch_model/', './output/test/step_5/pytorch_model/']

It requires to merge the all states from different nodes. Any easier solution to synchronize them?

Hello @cyk1337, we have discussed this with the DeepSpeed team and they have said that the current codebase assumes the user has a shared filesystem.

  1. As a workaround, use a shared filesystem.
  2. Raised issue on DeepSpeed repo as per their recommendation here: [BUG] Can’t load checkpoint without having shared filesystem in multi-node training when multi-node setup config remains same · Issue #2319 · microsoft/DeepSpeed ( Please follow it up there. Once that gets resolved, you should be able to load the state when not having shared filesystem.