Saving checkpoints only on improvement

umarbutler · October 30, 2024, 1:44pm

When using the Hugging Face Trainer, I would like to save a checkpoint only if my objective metric has improved.

Currently, I am using eval_steps=100,save_steps=100, save_limit=1 and load_best_model_at_end=True which means that every 100 steps, the latest checkpoint is getting written and then the previous checkpoint is getting deleted unless it is the best checkpoint.

This has done approximately 2TB of wear to my SSD in only a few days due to an excessive amount of checkpointing. I really don’t need to resume from the latest checkpoint, I just need the best checkpoint to be saved, and I’m not concerned about the run crashing, so in this case, there is really no need to be saving every 100 steps.

Additionally, it is not feasible to wait until the end of the run and load the best state because I am manually early stopping my runs. I do not wish to automate the early stopping either.

I’m happy to monkey patch my build of transformers if anyone is aware of the culprit lines I can comment out or modify.

balaatdell · January 6, 2025, 10:40pm

This is great feature. Thank you. Has this been already released?

felipeoes · February 8, 2025, 4:41pm

+1 for this

Topic		Replies	Views
Checkpointing in each step 🤗Transformers	1	956	January 20, 2021
🤗Trainer not saving after save_steps 🤗Transformers	2	4123	April 13, 2021
Saving only the best performing checkpoint 🤗Transformers	19	18271	May 23, 2023
Checkpoint missing Optimizer.pt? How to Resume? 🤗Transformers	7	5546	May 18, 2021
Saving check_points for run_mlm.py 🤗Transformers	1	752	January 27, 2021

Saving checkpoints *only* on improvement

Related topics

Saving checkpoints only on improvement