When trying to use EarlyStopping for Seq2SeqTrainer, e.g. patience was set to 1 and threshold 1.0:
training_args = Seq2SeqTrainingArguments(
output_dir='./',
num_train_epochs=3,
per_device_train_batch_size=4,
per_device_eval_batch_size=4,
logging_steps=1,
save_steps=5,
eval_steps=1,
max_steps=10,
evaluation_strategy="steps",
predict_with_generate=True,
report_to=None,
metric_for_best_model="chr_f_score",
load_best_model_at_end=True
)
early_stop = EarlyStoppingCallback(2, 1.0)
trainer = Seq2SeqTrainer(
model=model,
args=training_args,
train_dataset=valid_data.with_format("torch"),
eval_dataset=test_data.with_format("torch"),
compute_metrics=compute_metrics,
callbacks=[early_stop]
)
trainer.train()
The model continues training until max_steps
instead of stopping after the stopping criteria is met.
I’m not sure if this is a bug or maybe some argument is missing in when I use Seq2SeqTrainer, a working code to replicate the issue can be found on Huggingface EarlyStopping Callbacks | Kaggle
After the max_steps
, if we do some probing, somehow the early_stopping_patience_counter
has been reached but the training didn’t stop
>>> early_stop.early_stopping_patience_counter
2
Found the issue, when the save_state
is not in pace with the eval_steps
the patience and threshold for the early stopping works in a different way, where the patience will kick in after the first save_state
is met.
This might be a feature / bug.
Setting the eval_steps
to be the same as the save_state
would return the early stopping behavior as we intuitively expect.
Details on python - Why did the Seq2SeqTrainer not stop when the EarlyStoppingCallback criteria is met? - Stack Overflow
The problem here is that the save_steps and eval_steps are not equal. The model is set to evaluate at 1 step and the results printed at logging_steps of 1, even if model is set to stop at some patience, model cannot stop until it gets to the next save_step which you set as 5.
I faced this issue recently and I think this comment may help.
In the documentation it states that you have to set save_steps at same value as eval_steps in your Training arguments logic. I set my model to save at number of steps corresponding to the end of every epoch with a code save_steps=len(dataset)//batch_size, this gives me number of steps per epoch. This number has to be same as eval_steps which represents the number of steps at which an evaluation on the evaluation dataset happens. That way the early stopping will kick in at right time. If not the model will keep running until it gets to the next save_steps point specified at which it continues running if evaluation does not take place at that point. docs link here:Callbacks
1 Like