Checkpoints and disk storage

I am fine tuning custom dataset using this link
it is saving too many checkpoints that make my disk full, any suggestions?


If it’s the part using Trainer, you can use the argument save_total_limit (in TrainingArguments) to limit the number of checkpoints kept.


thanks, indeed its very useful for me,
One more question for my clarification, once it is saved, should I use the last checkpoint for using the model later? or my concept is not correct?
Also I want to know how many checkpoints should we have? can we make it one checkpoint?

There is a load_best_model argument too, that will automatically load your best model (according to a metric you choose). It’s all in the docs :wink:


@sgugger How “smart” is this feature? I remember that in openNMT when you specify a max checkpoint this does not take into account the best evaluated checkpoint up to that point. In other words, if there is an old checkpoint that is the best, it can still get deleted. That should not be the case, so I am wondering whether the implementation in the transformers trainer is a bit smarter?

thanks for information, I am looking into the doc

thanks , it solved my issue, esp working with ’ load_best_model_at_end’

However for this load_best_model_at_end , does it mean the models saved during checkpoints, for example, I am using distillBert, the best model here refers to the checkpoint?

It is smarter: the best checkpoint is always put at the top of the list of checkpoints, so it never gets deleted (or if it does, it’s a bug :wink: ).


Love it. Thanks!

1 Like