Trainer option to disable saving DeepSpeed checkpoints

I’d like to ask for opinions about adding a Trainer configuration option to disable saving of DeepSpeed checkpoints (potentially only keeping the model weights).

Context: I’m finetuning gpt-j-6b for basic translation phrases on consumer hardware (128GB System RAM and Nvidia GPU with 24GB RAM). I use the DeepSpeed Zero optimizer, stages 2 and 3 so 99% of my system memory is fully allocated (I also have a huge swap file on an NVMe).

Arguments for adding this flag:

  • The DeepSpeed checkpoints are huge (60GB+) and take a long time to save. Because my system RAM is used by the DeepSpeed optimizer, I can’t use local storage (no system ram available for buffers) so I have to transfer over the network to a NAS.
  • The DeepSpeed save_checkpoint code (as of v0.5.8) is very hungry for system RAM and may even have some memory leaks as after 10+ attempts I’ve never been able to save more than three checkpoints without the linux kernel deciding to OOM the finetuning python process. Out of frustration, I’ve modified the trainer code by commenting out all the calls to deepspeed.save_checkpoint(output_dir) and surprise: I’ve been able to finetune and have saved the model weights 15+ times without getting OOMed.

The default state for this option would obviously be to save everything. I read the previous topic on the subject: Disable checkpointing in Trainer which is nice but does not exactly cover my use case.

cc @stas who is the most knowledgeable on the DeepSpeed integration.

1 Like

One can’t not save intermediary checkpoints, since if you don’t how will you restart the training/finetuning w/o the optimizer states?

In fact it’s the stage3_gather_fp16_weights_on_model_save=true that uses additional CPU RAM to gather the weights from multi-gpus that would consume the most RAM, so you can disable it. Especially since if I understand your setup you’re using only one gpu.

The most efficient process should be to set stage3_gather_fp16_weights_on_model_save=false, save the weights as is which should not take any additional memory as it’s just a gpu → disk, and then at the end using zero_to_fp32.py to extract the full fp32 weights. Specifics are here: DeepSpeed Integration

Please let me know if applying this suggestion didn’t incur the additional CPU RAM use. And if you have some doc improvements to suggest I’m all ears.

The DeepSpeed checkpoints are huge (60GB+)

The checkpoint size is params*18 so your model must be around 3B params. It should be only params*2 larger than normal non-DS checkpoint since it saves both fp16 and fp32 weights, but otherwise DS checkpoints aren’t much bigger than non-DS checkpoint…

How bad is it if I shuffle the training and evaluation datasets and restart the process by loading pytorch_model.bin (generated from the previously stopped finetuning process). Did I mention that I’m a noob here?

Especially since if I understand your setup you’re using only one gpu.

Correct, one lonely GPU.

The docs and the DeepSpeed code say that setting stage3_gather_fp16_weights_on_model_save to False will cause pytorch_model.bin to not be generated.

Let me see if I understand your proposal correctly: if I modify the DS code by adding a special case that saves the fp16 weights when the number of gpus == 1 and stage3_gather_fp16_weights_on_model_save=false that would result in the most efficient save process for intermediate weights (for 1 gpu)?

Well, if you start from weights-only you waste training resources as your optimizer will take time to get to a point where you stopped at. So when you resume training you typically want to resume the optimizer states and not start from scratch. That’s the whole point of saving intermediary checkpoints. Shuffling data shouldn’t make any difference to wanting ongoing optim states.

You don’t need to change any DS code, you just need to set:

stage3_gather_fp16_weights_on_model_save=false

in ds_config file and it won’t gather and save the fp16 weights. And you can then extract the perfect fp32 weights at the end of your training using zero_to_fp32.py script.

Of course do a very short training and try using zero_to_fp32.py script and ensure that you know that this is what you want.

So the proposal is this:

  1. don’t gather any zero3 weights and use additional CPU memory to build a state_dict and just dump the intermediary states to disk really fast (should be ~0 CPU RAM overhead)
  2. resume training from checkpoint until you have finished your training - i.e. repeat this step as many times as you need to
  3. extract the final fp32 weights with zero_to_fp32.py

If something is unclear please don’t hesitate to ask for further clarifications, @mihai

@stas many thanks for the advice!

I understand what you are saying. Essentially, the following statement is key for my issue:

just dump the intermediary states to disk really fast

For a model like gpt-j-6b, that’s 68G for every checkpoint (fp16). I see now that my initial approach: move that data over the network was naive, a much better idea is to keep that on very fast local storage (NVMe) and use a reasonable save_total_limit (like 10).

Again, thank you! I’ll mark this thread as resolved.

Yes, Deepspeed’s workflow is designed to be very fast and so yes ideally you want to be saving each node’s data locally and have the system deal with syncronization after the training moved on with its work.

For example at the current 104B GPT training at JeanZay/BigScience we save 1.4TB checkpoints which takes about 60secs. This is with 256 gpus and 64 nodes, and each process only writes its data locally to an SSD drive. The system itself uses GPFS over SSD discs to synchronize, but this is transparent to the training software.

And your use-case is much simpler since you don’t have multi-node. So yes, a local NVMe storage for intermediary checkpoints would be ideal.

But again practice with a tiny training (like 5 steps) and make sure you know how to extract the final weights so that you know it works for your needs.