Manual Checkpointing (For e.g. Preemption)

tsteffek · January 25, 2023, 9:04am

Hi everyone,

I’m currently adjusting our training code to be able to cope with preemption on cloud resources. I identified trainer._save_checkpoint(trainer.model, None) as a working solution. However, there are two issues:

obviously it feels bad to use private methods. Is there a better way to do this?
if there isn’t, using _save_checkpoint() is “bugged” - in the edge case of multiple preemptions during the same evaluation, global_step will be increased every time, although no step has been made, just another try evaluation. I’m contemplating whether this is worth an issue, since it’s a private method.

Topic		Replies	Views
Saving checkpoints only on improvement 🤗Transformers	2	81	February 8, 2025
Checkpointing in each step 🤗Transformers	1	953	January 20, 2021
Disable checkpointing in Trainer 🤗Transformers	4	7922	January 10, 2022
Checkpoint missing Optimizer.pt? How to Resume? 🤗Transformers	7	5522	May 18, 2021
Interrupting run to trigger checkpoint? 🤗Transformers	0	8	December 13, 2024

Manual Checkpointing (For e.g. Preemption)

Related topics