Trainer option to disable saving DeepSpeed checkpoints

stas · January 10, 2022, 7:27pm

Well, if you start from weights-only you waste training resources as your optimizer will take time to get to a point where you stopped at. So when you resume training you typically want to resume the optimizer states and not start from scratch. That’s the whole point of saving intermediary checkpoints. Shuffling data shouldn’t make any difference to wanting ongoing optim states.

You don’t need to change any DS code, you just need to set:

stage3_gather_fp16_weights_on_model_save=false

in ds_config file and it won’t gather and save the fp16 weights. And you can then extract the perfect fp32 weights at the end of your training using zero_to_fp32.py script.

Of course do a very short training and try using zero_to_fp32.py script and ensure that you know that this is what you want.

So the proposal is this:

don’t gather any zero3 weights and use additional CPU memory to build a state_dict and just dump the intermediary states to disk really fast (should be ~0 CPU RAM overhead)
resume training from checkpoint until you have finished your training - i.e. repeat this step as many times as you need to
extract the final fp32 weights with zero_to_fp32.py

If something is unclear please don’t hesitate to ask for further clarifications, @mihai

Topic		Replies	Views
Avoid saving deepspeed optimizer and model states at checkpoints Beginners	2	609	February 19, 2025
Saving checkpoint is too slow with deepspeed DeepSpeed	5	2899	March 6, 2024
Saving weights while finetuning is on DeepSpeed	0	100	June 13, 2024
[Solved] Cannot restart training from deepspeed checkpoint Intermediate	3	2714	December 28, 2023
Saving checkpoints when using DeepSpeed is taking abnormally long DeepSpeed	0	183	July 22, 2024

Trainer option to disable saving DeepSpeed checkpoints

Related topics