Why save_steps should be a round multiple of eval_steps when load_best_model_at_end=True?

SMMousavi · October 17, 2021, 9:28am

I am confused about these three arguments, as explained here by @sgugger save_steps doesn’t care about the best model, so if I set eval_steps=100 and save_steps=200, every 200 steps, there is a checkpoint (200, 400, 600, …) but every 100 steps we have an evaluation of our model (100, 200, 300, …). Now, if the evaluation in 300 is the best, it will not be saved and is lost.

But if we set load_best_model_at_end=True and keep the eval_steps=100, save_steps=200, eval_steps will override the save_steps because it will save a checkpoint every 100 steps so it could load the best model at the end.

Here is the question: If all I said is true, why when load_best_model_at_end=True is set, save_steps should be a round multiple of eval_steps? It doesn’t make sense because when load_best_model_at_end is True, the model doesn’t care about save_steps and saves every eval_steps.

sgugger · October 18, 2021, 12:56am

the model doesn’t care about save_steps and saves every eval_steps .

No that’s not true anymore, the model is saved every save_steps, which needs to be a step where evaluation happens so it can keep track of the metrics (the answer you link to is a bit old ).

SMMousavi · October 18, 2021, 10:22am

I am actually experiencing this part :

I can send you a screenshot of the save folder, save_steps is 200, but checkpoints are based on eval_steps (every 100 steps).

sgugger · October 18, 2021, 1:46pm

You are entirely right! It’s a bug in the Trainer which should be fixed by this PR.

Topic		Replies	Views
Behaviour of load_best_model_at_end when save_steps is not a multiple of max_steps Beginners	1	322	December 6, 2022
Training Arguments - eval_step vs save_step Models	2	2648	March 18, 2021
Saving checkpoints only on improvement 🤗Transformers	2	74	February 8, 2025
Do trainer.save_model saves the best model? 🤗Transformers	3	6367	July 3, 2023
Saving only the best performing checkpoint 🤗Transformers	19	18209	May 23, 2023

Why save_steps should be a round multiple of eval_steps when load_best_model_at_end=True?

Related topics