I’m not sure I understand the meaning of “step” in run_speech_recognition_seq2seq (Trainer)
what is the meaning of step ? what is the relation to epochs ?
when save_total_limit = 5,
does it mean that always the best (metric) 5 steps are saved ? or
the last 5 steps are saved (and they may not be the with the best metric) ?
what is the meaning of step ? what is the relation to epochs ?
A “step” (also called “training step” or “optimization step”) is a single forward pass + backward pass through the model. The model takes in a batch of examples, computes the loss and gradients, and then updates the model’s parameters during the backward pass. That all happens during a single training step.
The relationship with an epoch is that an epoch is one pass through the full training set (the model has seen every training example 1 time). Suppose you have 8000 training examples and a batch size of 8. One epoch will consist of 1000 steps.
when save_total_limit = 5,
does it mean that always the best (metric) 5 steps are saved ? or
the last 5 steps are saved (and they may not be the with the best metric) ?
Quick comment on terminology - the word you’re looking for here is “checkpoints” not steps
To answer the question, there’s documentation for this is here. They say:
When load_best_model_at_end is enabled, the “best” checkpoint according to metric_for_best_model will always be retained in addition to the most recent ones. For example, for save_total_limit=5 and load_best_model_at_end, the four last checkpoints will always be retained alongside the best model.
So you wanna set load_best_model_at_end=True in the TrainingArguments and HF will keep the best checkpoint along with the most recent ones.