One can’t not save intermediary checkpoints, since if you don’t how will you restart the training/finetuning w/o the optimizer states?
In fact it’s the stage3_gather_fp16_weights_on_model_save=true that uses additional CPU RAM to gather the weights from multi-gpus that would consume the most RAM, so you can disable it. Especially since if I understand your setup you’re using only one gpu.
The most efficient process should be to set stage3_gather_fp16_weights_on_model_save=false, save the weights as is which should not take any additional memory as it’s just a gpu → disk, and then at the end using zero_to_fp32.py to extract the full fp32 weights. Specifics are here: DeepSpeed Integration
Please let me know if applying this suggestion didn’t incur the additional CPU RAM use. And if you have some doc improvements to suggest I’m all ears.
The DeepSpeed checkpoints are huge (60GB+)
The checkpoint size is params*18 so your model must be around 3B params. It should be only params*2 larger than normal non-DS checkpoint since it saves both fp16 and fp32 weights, but otherwise DS checkpoints aren’t much bigger than non-DS checkpoint…