Deepspeed resume training from saved states

cyk1337 · September 8, 2022, 1:50pm

To resume deepspeed zero-1 multi-node training, I use accelerate.save_state to save the optimizer states. However, each node only saves one partitioned states (e.g., 1/4 states for 4 machines). When loading states for resuming training, the load_state reports missing other checkpoints The following zero checkpoints paths are missing: ['./output/test/step_5/pytorch_model/bf16_zero_pp_rank_{8-31}_mp_rank_00_optim_states.pt', ...], and the subprocesses are killed.

It requires to manually synchronize the partitioned states from multiple nodes. Any easier solution to it?

Topic		Replies	Views
Questions about deepspeed resume training 🤗Accelerate	2	2061	October 21, 2022
[Solved] Cannot restart training from deepspeed checkpoint Intermediate	3	2685	December 28, 2023
Load a single GPU checkpoint to 2 GPUS (deepspeed) Intermediate	0	2006	June 29, 2022
Trainer option to disable saving DeepSpeed checkpoints 🤗Transformers	8	6519	May 23, 2023
Deepspeed and Trainer does not exit after training is completed Beginners	1	204	July 30, 2024

Deepspeed resume training from saved states

Related topics