My question is regarding
transformers.TrainingArguments class argument. There are two parameter,
Q1. Let’s just say I have set
save_total_limit=50. But the best model found by the metric doesn’t stay in the last 50 checkpoints. Maybe it is in the last 200 checkpoints.
load_best_model_at_end will select the best model from the last 50 checkpoints or the entire training duration?
Q2. The problem regarding this is, not always we have large SSD space (or even Regular storage) to train the model. So
save_total_limit is kind of a limited feature based on an individual’s disk space. On the contrary,
save_total_limit on the best checkpoints would be a great feature. In that way, you can even look for the ensemble of multiple checkpoints (may be good for generation tasks).
So is there any way you can save “best 5 checkpoints” (or best X) from the entire training duration?
Note: I tried to read the source code, but too many callback functions to deal with. It would be a great time-saving if someone can help.