I set the early stopping callback in my trainer as follows:
trainer = MyTrainer(
model=model,
args=training_args,
train_dataset=train_dataset,
eval_dataset=eval_dataset,
compute_metrics=compute_metrics,
callbacks=[EarlyStoppingCallback(3, 0.0)]
)
the values for this callback in the TrainingArguments
are as follows:
load_best_model_at_end=True,
metric_for_best_model=eval_loss,
greater_is_better=False
What I expect is that the training will continue as long as the eval_loss
metric continues to drop. While the training will stop only when the eval_loss
has not dropped for more than 3 epochs and the best model will be loaded.
During the training I get these values for the eval_loss
:
epoch1: 'eval_loss': 0.8832499384880066
epoch2: 'eval_loss': 0.6109879612922668
epoch3: 'eval_loss': 0.52149897813797
epoch4: 'eval_loss': 0.48024266958236694
therefore, as it always drops, I would expect the training to continue. Instead the training stopped after 4 epochs and during the evaluation it uploaded the model related to the first epoch, where the eval_loss had the greatest value, as you can see in the following:
01/26/2021 11:08:57 - INFO - __main__ - ***** Eval results *****
01/26/2021 11:08:57 - INFO - __main__ - eval_loss = 0.8832499384880066
Am I wrong to set some parameters?
Thanks!
EDIT: to clarify, I also printed the TrainerState
values at the end of the training:
log_history=[
{'eval_loss': 0.837020993232727, 'eval_accuracy_score': 0.8039973127309372, 'eval_precision': 0.7904381747255738, 'eval_recall': 0.7808047316067748, 'eval_f1': 0.7855919213776935, 'eval_runtime': 8.375, 'eval_samples_per_second': 67.343, 'epoch': 1.0, 'step': 411}, {'loss': 1.5377, 'learning_rate': 4.6958980235865466e-05, 'epoch': 1.22, 'step': 500},
{'eval_loss': 0.6051444411277771, 'eval_accuracy_score': 0.8406953308700034, 'eval_precision': 0.8297104717236403, 'eval_recall': 0.8243570212384622, 'eval_f1': 0.8270250831610176, 'eval_runtime': 8.3919, 'eval_samples_per_second': 67.208, 'epoch': 2.0, 'step': 822}, {'loss': 0.6285, 'learning_rate': 4.3917595505563304e-05, 'epoch': 2.43, 'step': 1000},
{'eval_loss': 0.5184187889099121, 'eval_accuracy_score': 0.856567013772254, 'eval_precision': 0.8464932024849194, 'eval_recall': 0.8425486154673358, 'eval_f1': 0.8445163028833199, 'eval_runtime': 8.4159, 'eval_samples_per_second': 67.016, 'epoch': 3.0, 'step': 1233}, {'loss': 0.4561, 'learning_rate': 4.087621077526113e-05, 'epoch': 3.65, 'step': 1500},
{'eval_loss': 0.46523478627204895, 'eval_accuracy_score': 0.868743701713134, 'eval_precision': 0.8599369085173502, 'eval_recall': 0.8550049287570571, 'eval_f1': 0.8574638267277793, 'eval_runtime': 8.3682, 'eval_samples_per_second': 67.398, 'epoch': 4.0, 'step': 1644}, {'train_runtime': 1783.4323, 'train_samples_per_second': 4.609, 'epoch': 4.0, 'step': 1644}
],
best_metric=0.837020993232727
as you can also see from here, the best_metric
is the value of the val_loss
of the first epoch and not the lowest among the epochs it has done (which are still few because the value is always decreasing and therefore the training should not even stop ā¦).