Question Regarding trainer arguments:: load_best_model_at_end

My question is regarding transformers.TrainingArguments class argument. There are two parameter,

  • save_total_limit
  • load_best_model_at_end

Q1. Let’s just say I have set save_total_limit=50. But the best model found by the metric doesn’t stay in the last 50 checkpoints. Maybe it is in the last 200 checkpoints.
Now should load_best_model_at_end will select the best model from the last 50 checkpoints or the entire training duration?

Q2. The problem regarding this is, not always we have large SSD space (or even Regular storage) to train the model. So save_total_limit is kind of a limited feature based on an individual’s disk space. On the contrary, save_total_limit on the best checkpoints would be a great feature. In that way, you can even look for the ensemble of multiple checkpoints (may be good for generation tasks).

So is there any way you can save “best 5 checkpoints” (or best X) from the entire training duration?

Note: I tried to read the source code, but too many callback functions to deal with. It would be a great time-saving if someone can help.

1 Like

When you use load_best_moel_at_end in combination with save_total_limit, the checkpoint with the best metric is never deleted (it’s always put first in the list of all checkpoints).

There is no way for now to keep the 5 best checkpoints, only the best one.

2 Likes

Hi! @sgugger Thanks a lot for your response. It is totally clear now.