Interrupting run to trigger checkpoint?

After hundreds of runs it finally happened to me: forgot to update save_steps to a reasonable value after changing batch sizes :scream:

In this case I’m using SFTTrainer to finetune Mistral Large on an H200.

  • Loss has already plateaued
  • I didn’t implement early stopping :sweat:
  • There are still about half an epoch left (7 hours) before the first checkpoint is generated: at this rate the model will likely overfit well before the checkpoint.

Is there any way to salvage this situation? I’m running in a Jupyter notebook and read that there might be a chance that interrupts are handled gracefully and cause an immediate checkpoint to be generated?

Any other tricks that might let me salvage the run, or am I SOL?

1 Like