Manual Checkpointing (For e.g. Preemption)

Hi everyone,

I’m currently adjusting our training code to be able to cope with preemption on cloud resources. I identified trainer._save_checkpoint(trainer.model, None) as a working solution. However, there are two issues:

  • obviously it feels bad to use private methods. Is there a better way to do this?
  • if there isn’t, using _save_checkpoint() is “bugged” - in the edge case of multiple preemptions during the same evaluation, global_step will be increased every time, although no step has been made, just another try evaluation. I’m contemplating whether this is worth an issue, since it’s a private method.