When using the Hugging Face Trainer, I would like to save a checkpoint only if my objective metric has improved.
Currently, I am using eval_steps=100
,save_steps=100
, save_limit=1
and load_best_model_at_end=True
which means that every 100 steps, the latest checkpoint is getting written and then the previous checkpoint is getting deleted unless it is the best checkpoint.
This has done approximately 2TB of wear to my SSD in only a few days due to an excessive amount of checkpointing. I really don’t need to resume from the latest checkpoint, I just need the best checkpoint to be saved, and I’m not concerned about the run crashing, so in this case, there is really no need to be saving every 100 steps.
Additionally, it is not feasible to wait until the end of the run and load the best state because I am manually early stopping my runs. I do not wish to automate the early stopping either.
I’m happy to monkey patch my build of transformers
if anyone is aware of the culprit lines I can comment out or modify.