Trainer option to disable saving DeepSpeed checkpoints

stas · January 10, 2022, 5:00pm

One can’t not save intermediary checkpoints, since if you don’t how will you restart the training/finetuning w/o the optimizer states?

In fact it’s the stage3_gather_fp16_weights_on_model_save=true that uses additional CPU RAM to gather the weights from multi-gpus that would consume the most RAM, so you can disable it. Especially since if I understand your setup you’re using only one gpu.

The most efficient process should be to set stage3_gather_fp16_weights_on_model_save=false, save the weights as is which should not take any additional memory as it’s just a gpu → disk, and then at the end using zero_to_fp32.py to extract the full fp32 weights. Specifics are here: DeepSpeed Integration

Please let me know if applying this suggestion didn’t incur the additional CPU RAM use. And if you have some doc improvements to suggest I’m all ears.

The DeepSpeed checkpoints are huge (60GB+)

The checkpoint size is params*18 so your model must be around 3B params. It should be only params*2 larger than normal non-DS checkpoint since it saves both fp16 and fp32 weights, but otherwise DS checkpoints aren’t much bigger than non-DS checkpoint…

Topic		Replies	Views
Avoid saving deepspeed optimizer and model states at checkpoints Beginners	2	749	February 19, 2025
Saving checkpoint is too slow with deepspeed DeepSpeed	5	3009	March 6, 2024
Saving weights while finetuning is on DeepSpeed	0	109	June 13, 2024
[Solved] Cannot restart training from deepspeed checkpoint Intermediate	3	2770	December 28, 2023
Saving checkpoints when using DeepSpeed is taking abnormally long DeepSpeed	0	204	July 22, 2024

Trainer option to disable saving DeepSpeed checkpoints

Related topics