Hi everyone,
I’m currently adjusting our training code to be able to cope with preemption on cloud resources. I identified trainer._save_checkpoint(trainer.model, None)
as a working solution. However, there are two issues:
- obviously it feels bad to use private methods. Is there a better way to do this?
- if there isn’t, using
_save_checkpoint()
is “bugged” - in the edge case of multiple preemptions during the same evaluation,global_step
will be increased every time, although no step has been made, just another try evaluation. I’m contemplating whether this is worth an issue, since it’s a private method.