After hundreds of runs it finally happened to me: forgot to update save_steps to a reasonable value after changing batch sizes
In this case I’m using SFTTrainer to finetune Mistral Large on an H200.
- Loss has already plateaued
- I didn’t implement early stopping
- There are still about half an epoch left (7 hours) before the first checkpoint is generated: at this rate the model will likely overfit well before the checkpoint.
Is there any way to salvage this situation? I’m running in a Jupyter notebook and read that there might be a chance that interrupts are handled gracefully and cause an immediate checkpoint to be generated?
Any other tricks that might let me salvage the run, or am I SOL?