Trainer option to disable saving DeepSpeed checkpoints

mihai · January 10, 2022, 6:34pm

How bad is it if I shuffle the training and evaluation datasets and restart the process by loading pytorch_model.bin (generated from the previously stopped finetuning process). Did I mention that I’m a noob here?

Especially since if I understand your setup you’re using only one gpu.

Correct, one lonely GPU.

The docs and the DeepSpeed code say that setting stage3_gather_fp16_weights_on_model_save to False will cause pytorch_model.bin to not be generated.

Let me see if I understand your proposal correctly: if I modify the DS code by adding a special case that saves the fp16 weights when the number of gpus == 1 and stage3_gather_fp16_weights_on_model_save=false that would result in the most efficient save process for intermediate weights (for 1 gpu)?

Topic		Replies	Views
Avoid saving deepspeed optimizer and model states at checkpoints Beginners	2	723	February 19, 2025
Saving checkpoint is too slow with deepspeed DeepSpeed	5	2986	March 6, 2024
Checkpoint missing Optimizer.pt? How to Resume? 🤗Transformers	7	5596	May 18, 2021
Checkpoint breaks with deepspeed 🤗Transformers	6	3487	March 20, 2021
Questions about deepspeed resume training 🤗Accelerate	2	2112	October 21, 2022

Trainer option to disable saving DeepSpeed checkpoints

Related topics