Trainer option to disable saving DeepSpeed checkpoints

How bad is it if I shuffle the training and evaluation datasets and restart the process by loading pytorch_model.bin (generated from the previously stopped finetuning process). Did I mention that I’m a noob here?

Especially since if I understand your setup you’re using only one gpu.

Correct, one lonely GPU.

The docs and the DeepSpeed code say that setting stage3_gather_fp16_weights_on_model_save to False will cause pytorch_model.bin to not be generated.

Let me see if I understand your proposal correctly: if I modify the DS code by adding a special case that saves the fp16 weights when the number of gpus == 1 and stage3_gather_fp16_weights_on_model_save=false that would result in the most efficient save process for intermediate weights (for 1 gpu)?