Trainer option to disable saving DeepSpeed checkpoints

Well, if you start from weights-only you waste training resources as your optimizer will take time to get to a point where you stopped at. So when you resume training you typically want to resume the optimizer states and not start from scratch. That’s the whole point of saving intermediary checkpoints. Shuffling data shouldn’t make any difference to wanting ongoing optim states.

You don’t need to change any DS code, you just need to set:

stage3_gather_fp16_weights_on_model_save=false

in ds_config file and it won’t gather and save the fp16 weights. And you can then extract the perfect fp32 weights at the end of your training using zero_to_fp32.py script.

Of course do a very short training and try using zero_to_fp32.py script and ensure that you know that this is what you want.

So the proposal is this:

  1. don’t gather any zero3 weights and use additional CPU memory to build a state_dict and just dump the intermediary states to disk really fast (should be ~0 CPU RAM overhead)
  2. resume training from checkpoint until you have finished your training - i.e. repeat this step as many times as you need to
  3. extract the final fp32 weights with zero_to_fp32.py

If something is unclear please don’t hesitate to ask for further clarifications, @mihai