Trainer option to disable saving DeepSpeed checkpoints

@stas many thanks for the advice!

I understand what you are saying. Essentially, the following statement is key for my issue:

just dump the intermediary states to disk really fast

For a model like gpt-j-6b, that’s 68G for every checkpoint (fp16). I see now that my initial approach: move that data over the network was naive, a much better idea is to keep that on very fast local storage (NVMe) and use a reasonable save_total_limit (like 10).

Again, thank you! I’ll mark this thread as resolved.