Trainer option to disable saving DeepSpeed checkpoints

mihai · January 10, 2022, 9:30pm

@stas many thanks for the advice!

I understand what you are saying. Essentially, the following statement is key for my issue:

just dump the intermediary states to disk really fast

For a model like gpt-j-6b, that’s 68G for every checkpoint (fp16). I see now that my initial approach: move that data over the network was naive, a much better idea is to keep that on very fast local storage (NVMe) and use a reasonable save_total_limit (like 10).

Again, thank you! I’ll mark this thread as resolved.

Topic		Replies	Views
Avoid saving deepspeed optimizer and model states at checkpoints Beginners	2	749	February 19, 2025
Saving checkpoint is too slow with deepspeed DeepSpeed	5	3009	March 6, 2024
Saving weights while finetuning is on DeepSpeed	0	109	June 13, 2024
[Solved] Cannot restart training from deepspeed checkpoint Intermediate	3	2770	December 28, 2023
Saving checkpoints when using DeepSpeed is taking abnormally long DeepSpeed	0	204	July 22, 2024

Trainer option to disable saving DeepSpeed checkpoints

Related topics