Early stopping callback problem

Hello,

I am having problems with the EarlyStoppingCallback I set up in my trainer class as below:

training_args = TrainingArguments(
    output_dir = 'BERT',
    num_train_epochs = epochs,
    do_train = True,
    do_eval = True,
    evaluation_strategy = 'epoch',
    logging_strategy = 'epoch',
    per_device_train_batch_size = batch_size,
    per_device_eval_batch_size = batch_size,
    warmup_steps = 250,
    weight_decay = 0.01,
    fp16 = True,
    metric_for_best_model = 'eval_loss',
    load_best_model_at_end = True
)

trainer = MyTrainer(
    model = bert,
    args = training_args,
    train_dataset = train_dataset,
    eval_dataset = val_dataset,
    compute_metrics = compute_metrics,
    callbacks = [EarlyStoppingCallback(early_stopping_patience = 3)]
)

trainer.train()

I keep getting the following error:

I already tried running the code without the metric_for_best_model arg, but it still gives me the same error.

I tweaked the Trainer class a bit to report metrics during training, and also created custom_metrics to report during validation. I suspect that maybe I made a mistake there and that’s why I can’t retrieve the validation loss now. See here for the tweaked code.

Thanks in advance!!

You won’t be able to use the EarlyStoppingCallback with a nested dictionary of metrics as you did, no. And is will need the metric you are looking for to be prefixed by eval_ (otherwise it will add it unless you change the code too). You probably will need to write your own version of the callback for this use case.

At some point, instead of rewriting the whole Trainer, you might be interested in writing your own training loop with Accelerate. You can still have mixed precision training and distributed training but will have full control over your training loop. There is one example for each task using accelerate (the run_xxx_no_trainer) in the examples of Transformers

1 Like

Thanks so much @sgugger! Will try it out!