Interrupting run to trigger checkpoint?

hf-100 · December 13, 2024, 5:55pm

After hundreds of runs it finally happened to me: forgot to update save_steps to a reasonable value after changing batch sizes

In this case I’m using SFTTrainer to finetune Mistral Large on an H200.

Loss has already plateaued
I didn’t implement early stopping
There are still about half an epoch left (7 hours) before the first checkpoint is generated: at this rate the model will likely overfit well before the checkpoint.

Is there any way to salvage this situation? I’m running in a Jupyter notebook and read that there might be a chance that interrupts are handled gracefully and cause an immediate checkpoint to be generated?

Any other tricks that might let me salvage the run, or am I SOL?

Topic		Replies	Views
Questions about Checkpoints within HF Beginners	0	304	May 10, 2021
Does checkpoint have memory in the case of resume from checkpoint Beginners	0	222	February 28, 2024
Checkpoint missing Optimizer.pt? How to Resume? 🤗Transformers	7	5475	May 18, 2021
Disable checkpointing in Trainer 🤗Transformers	4	7775	January 10, 2022
Resume_from_checkpoint Models	1	2339	June 25, 2024

Interrupting run to trigger checkpoint?

Related topics