Question Regarding trainer arguments:: load_best_model_at_end

sbmaruf · April 18, 2021, 10:14am

My question is regarding transformers.TrainingArguments class argument. There are two parameter,

save_total_limit
load_best_model_at_end

Q1. Let’s just say I have set save_total_limit=50. But the best model found by the metric doesn’t stay in the last 50 checkpoints. Maybe it is in the last 200 checkpoints.
Now should load_best_model_at_end will select the best model from the last 50 checkpoints or the entire training duration?

Q2. The problem regarding this is, not always we have large SSD space (or even Regular storage) to train the model. So save_total_limit is kind of a limited feature based on an individual’s disk space. On the contrary, save_total_limit on the best checkpoints would be a great feature. In that way, you can even look for the ensemble of multiple checkpoints (may be good for generation tasks).

So is there any way you can save “best 5 checkpoints” (or best X) from the entire training duration?

Note: I tried to read the source code, but too many callback functions to deal with. It would be a great time-saving if someone can help.

sgugger · April 19, 2021, 12:58pm

When you use load_best_moel_at_end in combination with save_total_limit, the checkpoint with the best metric is never deleted (it’s always put first in the list of all checkpoints).

There is no way for now to keep the 5 best checkpoints, only the best one.

sbmaruf · April 19, 2021, 2:23pm

Hi! @sgugger Thanks a lot for your response. It is totally clear now.

Topic		Replies	Views
Checkpoints and disk storage 🤗Transformers	15	8074	June 2, 2024
Saving only the best performing checkpoint 🤗Transformers	19	18221	May 23, 2023
Save only best model in Trainer 🤗Transformers	31	85450	June 25, 2024
Loading a model from local with best checkpoint Beginners	10	32500	September 24, 2023
Tainer.train() - save_total_limit - latest vs best Beginners	0	411	June 2, 2022

Question Regarding trainer arguments:: load_best_model_at_end

Related topics