Problem with EarlyStoppingCallback

I set the early stopping callback in my trainer as follows:

trainer = MyTrainer(
        model=model,
        args=training_args,
        train_dataset=train_dataset,
        eval_dataset=eval_dataset,
        compute_metrics=compute_metrics,
        callbacks=[EarlyStoppingCallback(3, 0.0)]
    )

the values for this callback in the TrainingArguments are as follows:

load_best_model_at_end=True, 
metric_for_best_model=eval_loss, 
greater_is_better=False

What I expect is that the training will continue as long as the eval_loss metric continues to drop. While the training will stop only when the eval_loss has not dropped for more than 3 epochs and the best model will be loaded.
During the training I get these values for the eval_loss:

epoch1: 'eval_loss': 0.8832499384880066
epoch2: 'eval_loss': 0.6109879612922668
epoch3: 'eval_loss': 0.52149897813797
epoch4: 'eval_loss': 0.48024266958236694

therefore, as it always drops, I would expect the training to continue. Instead the training stopped after 4 epochs and during the evaluation it uploaded the model related to the first epoch, where the eval_loss had the greatest value, as you can see in the following:

01/26/2021 11:08:57 - INFO - __main__ -  ***** Eval results *****
01/26/2021 11:08:57 - INFO - __main__ -    eval_loss = 0.8832499384880066

Am I wrong to set some parameters?
Thanks! :slight_smile:

EDIT: to clarify, I also printed the TrainerState values at the end of the training:

log_history=[
{'eval_loss': 0.837020993232727, 'eval_accuracy_score': 0.8039973127309372, 'eval_precision': 0.7904381747255738, 'eval_recall': 0.7808047316067748, 'eval_f1': 0.7855919213776935, 'eval_runtime': 8.375, 'eval_samples_per_second': 67.343, 'epoch': 1.0, 'step': 411}, {'loss': 1.5377, 'learning_rate': 4.6958980235865466e-05, 'epoch': 1.22, 'step': 500}, 
{'eval_loss': 0.6051444411277771, 'eval_accuracy_score': 0.8406953308700034, 'eval_precision': 0.8297104717236403, 'eval_recall': 0.8243570212384622, 'eval_f1': 0.8270250831610176, 'eval_runtime': 8.3919, 'eval_samples_per_second': 67.208, 'epoch': 2.0, 'step': 822}, {'loss': 0.6285, 'learning_rate': 4.3917595505563304e-05, 'epoch': 2.43, 'step': 1000}, 
{'eval_loss': 0.5184187889099121, 'eval_accuracy_score': 0.856567013772254, 'eval_precision': 0.8464932024849194, 'eval_recall': 0.8425486154673358, 'eval_f1': 0.8445163028833199, 'eval_runtime': 8.4159, 'eval_samples_per_second': 67.016, 'epoch': 3.0, 'step': 1233}, {'loss': 0.4561, 'learning_rate': 4.087621077526113e-05, 'epoch': 3.65, 'step': 1500}, 
{'eval_loss': 0.46523478627204895, 'eval_accuracy_score': 0.868743701713134, 'eval_precision': 0.8599369085173502, 'eval_recall': 0.8550049287570571, 'eval_f1': 0.8574638267277793, 'eval_runtime': 8.3682, 'eval_samples_per_second': 67.398, 'epoch': 4.0, 'step': 1644}, {'train_runtime': 1783.4323, 'train_samples_per_second': 4.609, 'epoch': 4.0, 'step': 1644}
], 
best_metric=0.837020993232727

as you can also see from here, the best_metric is the value of the val_loss of the first epoch and not the lowest among the epochs it has done (which are still few because the value is always decreasing and therefore the training should not even stop …).

1 Like