[Maybe Bug] When using EarlyStopping Callbacks with Seq2SeqTraininer, training didn't stop

When trying to use EarlyStopping for Seq2SeqTrainer, e.g. patience was set to 1 and threshold 1.0:

training_args = Seq2SeqTrainingArguments(
    output_dir='./',
    num_train_epochs=3,
    per_device_train_batch_size=4,
    per_device_eval_batch_size=4,
    logging_steps=1,
    save_steps=5,
    eval_steps=1,
    max_steps=10,
    evaluation_strategy="steps",
    predict_with_generate=True,
    report_to=None,
    metric_for_best_model="chr_f_score",
    load_best_model_at_end=True
)

early_stop = EarlyStoppingCallback(2, 1.0)

trainer = Seq2SeqTrainer(
    model=model,
    args=training_args,
    train_dataset=valid_data.with_format("torch"),
    eval_dataset=test_data.with_format("torch"),
    compute_metrics=compute_metrics,
    callbacks=[early_stop]
)

trainer.train()

The model continues training until max_steps instead of stopping after the stopping criteria is met.

I’m not sure if this is a bug or maybe some argument is missing in when I use Seq2SeqTrainer, a working code to replicate the issue can be found on Huggingface EarlyStopping Callbacks | Kaggle


After the max_steps, if we do some probing, somehow the early_stopping_patience_counter has been reached but the training didn’t stop

>>> early_stop.early_stopping_patience_counter
2

Also asked on python - Why did the Seq2SeqTrainer not stop when the EarlyStoppingCallback criteria is met? - Stack Overflow

Found the issue, when the save_state is not in pace with the eval_steps the patience and threshold for the early stopping works in a different way, where the patience will kick in after the first save_state is met.

This might be a feature / bug.

Setting the eval_steps to be the same as the save_state would return the early stopping behavior as we intuitively expect.

Details on python - Why did the Seq2SeqTrainer not stop when the EarlyStoppingCallback criteria is met? - Stack Overflow

The problem here is that the save_steps and eval_steps are not equal. The model is set to evaluate at 1 step and the results printed at logging_steps of 1, even if model is set to stop at some patience, model cannot stop until it gets to the next save_step which you set as 5.

I faced this issue recently and I think this comment may help.

In the documentation it states that you have to set save_steps at same value as eval_steps in your Training arguments logic. I set my model to save at number of steps corresponding to the end of every epoch with a code save_steps=len(dataset)//batch_size, this gives me number of steps per epoch. This number has to be same as eval_steps which represents the number of steps at which an evaluation on the evaluation dataset happens. That way the early stopping will kick in at right time. If not the model will keep running until it gets to the next save_steps point specified at which it continues running if evaluation does not take place at that point. docs link here:Callbacks

1 Like