Problem with EarlyStoppingCallback

I set the early stopping callback in my trainer as follows:

trainer = MyTrainer(
        callbacks=[EarlyStoppingCallback(3, 0.0)]

the values for this callback in the TrainingArguments are as follows:


What I expect is that the training will continue as long as the eval_loss metric continues to drop. While the training will stop only when the eval_loss has not dropped for more than 3 epochs and the best model will be loaded.
During the training I get these values for the eval_loss:

epoch1: 'eval_loss': 0.8832499384880066
epoch2: 'eval_loss': 0.6109879612922668
epoch3: 'eval_loss': 0.52149897813797
epoch4: 'eval_loss': 0.48024266958236694

therefore, as it always drops, I would expect the training to continue. Instead the training stopped after 4 epochs and during the evaluation it uploaded the model related to the first epoch, where the eval_loss had the greatest value, as you can see in the following:

01/26/2021 11:08:57 - INFO - __main__ -  ***** Eval results *****
01/26/2021 11:08:57 - INFO - __main__ -    eval_loss = 0.8832499384880066

Am I wrong to set some parameters?
Thanks! :slight_smile:

EDIT: to clarify, I also printed the TrainerState values at the end of the training:

{'eval_loss': 0.837020993232727, 'eval_accuracy_score': 0.8039973127309372, 'eval_precision': 0.7904381747255738, 'eval_recall': 0.7808047316067748, 'eval_f1': 0.7855919213776935, 'eval_runtime': 8.375, 'eval_samples_per_second': 67.343, 'epoch': 1.0, 'step': 411}, {'loss': 1.5377, 'learning_rate': 4.6958980235865466e-05, 'epoch': 1.22, 'step': 500}, 
{'eval_loss': 0.6051444411277771, 'eval_accuracy_score': 0.8406953308700034, 'eval_precision': 0.8297104717236403, 'eval_recall': 0.8243570212384622, 'eval_f1': 0.8270250831610176, 'eval_runtime': 8.3919, 'eval_samples_per_second': 67.208, 'epoch': 2.0, 'step': 822}, {'loss': 0.6285, 'learning_rate': 4.3917595505563304e-05, 'epoch': 2.43, 'step': 1000}, 
{'eval_loss': 0.5184187889099121, 'eval_accuracy_score': 0.856567013772254, 'eval_precision': 0.8464932024849194, 'eval_recall': 0.8425486154673358, 'eval_f1': 0.8445163028833199, 'eval_runtime': 8.4159, 'eval_samples_per_second': 67.016, 'epoch': 3.0, 'step': 1233}, {'loss': 0.4561, 'learning_rate': 4.087621077526113e-05, 'epoch': 3.65, 'step': 1500}, 
{'eval_loss': 0.46523478627204895, 'eval_accuracy_score': 0.868743701713134, 'eval_precision': 0.8599369085173502, 'eval_recall': 0.8550049287570571, 'eval_f1': 0.8574638267277793, 'eval_runtime': 8.3682, 'eval_samples_per_second': 67.398, 'epoch': 4.0, 'step': 1644}, {'train_runtime': 1783.4323, 'train_samples_per_second': 4.609, 'epoch': 4.0, 'step': 1644}

as you can also see from here, the best_metric is the value of the val_loss of the first epoch and not the lowest among the epochs it has done (which are still few because the value is always decreasing and therefore the training should not even stop …).

1 Like

I’m trying to reproduce your issue, but on my side, the best_metric is correct and decreasing. Could you check you are using the latest version of Transformers and post the way you are creating your TrainingArguments?

I’m using version 4.2.0 of Transformers.

For the TrainingArguments, I’m using run_ner as a starting script where this function is used to take arguments:

parser = HfArgumentParser((ModelArguments, DataTrainingArguments, TrainingArguments))
model_args, data_args, training_args = parser.parse_args_into_dataclasses()

The value of TrainingArguments that I modify I pass them in input through the script .sh in this way:

export MAX_LENGTH=200
export BERT_MODEL=bert-base-uncased
export OUTPUT_DIR=transformers
export BATCH_SIZE=32
export NUM_EPOCHS=3
export SAVE_STEPS=500
export SEED=1

python3 \
--task_type POS \
--data_dir . \
--model_name_or_path $BERT_MODEL \
--output_dir $OUTPUT_DIR \
--max_seq_length  $MAX_LENGTH \
--num_train_epochs $NUM_EPOCHS \
--per_device_train_batch_size $BATCH_SIZE \
--save_steps $SAVE_STEPS \
--seed $SEED \
--load_best_model_at_end \
--evaluation_strategy epoch \
--metric_for_best_model eval_loss \
--greater_is_better False \
--disable_tqdm False \
--save_total_limit 2 \
--do_train \
--do_eval \

Ah yes, this comes from a bug in the argument parser that will be fixed by this PR. Basically greater_is_better is stored as a string and not a bool, so the tests using it don’t give the right results.

Since you are using eval_loss, it will default to False if you don’t say anything, so a workaround while you wait for the fix is to remove --greater_is_better False \ in your command.

1 Like

Ah, ok thanks a lot! I’ll keep an eye on the PR.

So, vice versa, if the metric was eval_accuracy should I use --greater_is_better with the string true or with the boolean True? Or would it not work either way until the PR is approved?

It won’t work until the PR is merged either. But it will also default to the right value, so you won’t need to set it :wink:


Perfect, thank you so much for the help and great work you are doing for the community! :grin:

1 Like

Hello, I have a similar problem. I used metric_for_best_model = eval_f1 for the model. But EarlyStopping stops the model even when f1 score is increasing. I did not include greater_is_better in my training arguments. Should I include it or not?

training_args = TrainingArguments(


evaluation_strategy = "steps",


save_strategy = 'steps',

save_steps =1000,









logging_strategy ='epoch',




@Motahar , actually, your F1 was not increasing: since logging steps == 3000 it could not increase for 3 epochs. Hence, since you (probably) set early_stpping_patience=3, the training was interrupted.

But in step 6000, F1 score improved. It went down in 4000,5000 but increased in step 6000. It consecutively went down for 2 epochs( Here I am assuming 2epoch means 2 logging steps) but not 3.

Was looking at EarlyStoppingCallbacks, I’ve found some quirks that might be a feature/bug where the patience kick in only after the first save_state is met, as documented on [Maybe Bug] When using EarlyStopping Callbacks with Seq2SeqTraininer, training didn't stop

Posting the comment here, just in case anyone else found this post and had the found similar quirks when using early stopping.