Why save_steps should be a round multiple of eval_steps when load_best_model_at_end=True?

I am confused about these three arguments, as explained here by @sgugger save_steps doesn’t care about the best model, so if I set eval_steps=100 and save_steps=200, every 200 steps, there is a checkpoint (200, 400, 600, …) but every 100 steps we have an evaluation of our model (100, 200, 300, …). Now, if the evaluation in 300 is the best, it will not be saved and is lost.

But if we set load_best_model_at_end=True and keep the eval_steps=100, save_steps=200, eval_steps will override the save_steps because it will save a checkpoint every 100 steps so it could load the best model at the end.

Here is the question: If all I said is true, why when load_best_model_at_end=True is set, save_steps should be a round multiple of eval_steps? It doesn’t make sense because when load_best_model_at_end is True, the model doesn’t care about save_steps and saves every eval_steps.

the model doesn’t care about save_steps and saves every eval_steps .

No that’s not true anymore, the model is saved every save_steps, which needs to be a step where evaluation happens so it can keep track of the metrics (the answer you link to is a bit old :wink: ).

1 Like

I am actually experiencing this part :

I can send you a screenshot of the save folder, save_steps is 200, but checkpoints are based on eval_steps (every 100 steps).

You are entirely right! It’s a bug in the Trainer which should be fixed by this PR.

1 Like