Questions about deepspeed resume training

cyk1337 · September 8, 2022, 1:43pm

Background: Use deepspeed (use ZeRO-1) for multi-node training, save optimizers to resume training.
I find that accelerate.save_state only save the partitioned optimizer state on each machine. if we use 32 gpus (4 node), each node only saves 8 partitions. For the 1st node, it saves bf16_zero_pp_rank_(0-7)_mp_rank_00_optim_states.pt. When call load_state to resume training for deepspeed, the 1 st node reports the issue of The following zero checkpoints paths are missing: ['./output/test/step_5/pytorch_model/bf16_zero_pp_rank_8_mp_rank_00_optim_states.pt', './output/test/step_5/pytorch_model/bf16_zero_pp_rank_9_mp_rank_00_optim_states.pt', './output/test/step_5/pytorch_model/bf16_zero_pp_rank_10_mp_rank_00_optim_states.pt', './output/test/step_5/pytorch_model/bf16_zero_pp_rank_11_mp_rank_00_optim_states.pt', './output/test/step_5/pytorch_model/bf16_zero_pp_rank_12_mp_rank_00_optim_states.pt', './output/test/step_5/pytorch_model/bf16_zero_pp_rank_13_mp_rank_00_optim_states.pt', './output/test/step_5/pytorch_model/bf16_zero_pp_rank_14_mp_rank_00_optim_states.pt', './output/test/step_5/pytorch_model/bf16_zero_pp_rank_15_mp_rank_00_optim_states.pt', './output/test/step_5/pytorch_model/bf16_zero_pp_rank_16_mp_rank_00_optim_states.pt', './output/test/step_5/pytorch_model/bf16_zero_pp_rank_17_mp_rank_00_optim_states.pt', './output/test/step_5/pytorch_model/bf16_zero_pp_rank_18_mp_rank_00_optim_states.pt', './output/test/step_5/pytorch_model/bf16_zero_pp_rank_19_mp_rank_00_optim_states.pt', './output/test/step_5/pytorch_model/bf16_zero_pp_rank_20_mp_rank_00_optim_states.pt', './output/test/step_5/pytorch_model/bf16_zero_pp_rank_21_mp_rank_00_optim_states.pt', './output/test/step_5/pytorch_model/bf16_zero_pp_rank_22_mp_rank_00_optim_states.pt', './output/test/step_5/pytorch_model/bf16_zero_pp_rank_23_mp_rank_00_optim_states.pt', './output/test/step_5/pytorch_model/bf16_zero_pp_rank_24_mp_rank_00_optim_states.pt', './output/test/step_5/pytorch_model/bf16_zero_pp_rank_25_mp_rank_00_optim_states.pt', './output/test/step_5/pytorch_model/bf16_zero_pp_rank_26_mp_rank_00_optim_states.pt', './output/test/step_5/pytorch_model/bf16_zero_pp_rank_27_mp_rank_00_optim_states.pt', './output/test/step_5/pytorch_model/bf16_zero_pp_rank_28_mp_rank_00_optim_states.pt', './output/test/step_5/pytorch_model/bf16_zero_pp_rank_29_mp_rank_00_optim_states.pt', './output/test/step_5/pytorch_model/bf16_zero_pp_rank_30_mp_rank_00_optim_states.pt', './output/test/step_5/pytorch_model/bf16_zero_pp_rank_31_mp_rank_00_optim_states.pt']

It requires to merge the all states from different nodes. Any easier solution to synchronize them?

smangrul · September 13, 2022, 5:27am

Hello @cyk1337, we have discussed this with the DeepSpeed team and they have said that the current codebase assumes the user has a shared filesystem.

As a workaround, use a shared filesystem.
Raised issue on DeepSpeed repo as per their recommendation here: [BUG] Can’t load checkpoint without having shared filesystem in multi-node training when multi-node setup config remains same · Issue #2319 · microsoft/DeepSpeed (github.com). Please follow it up there. Once that gets resolved, you should be able to load the state when not having shared filesystem.

cyk1337 · October 21, 2022, 4:40am

Thanks for your reply! I have posted an issue on DeepSpeed repo. Hope it will be solved ASAP;)

Topic		Replies	Views
Deepspeed resume training from saved states 🤗Accelerate	0	1271	September 8, 2022
Saving checkpoint is too slow with deepspeed DeepSpeed	5	2821	March 6, 2024
Load a single GPU checkpoint to 2 GPUS (deepspeed) Intermediate	0	2003	June 29, 2022
Avoid saving deepspeed optimizer and model states at checkpoints Beginners	2	481	February 19, 2025
Model checkpoints on a worker node in multi-node training 🤗Transformers	0	735	June 7, 2023

Questions about deepspeed resume training

Related topics