Trainer load_best_model f1 score vs. loss and overfitting

techpat · August 3, 2022, 8:56am

Hi, I have trained a roberta-large and specified “load_best_model_at_end=True” and “metric_for_best_model=f1”. During training, I can see overfitting after the 6 epoch, which is the sweetspot. In Epoch 8, which is the next one to evaluate due to gradient accumulation (just a side info, not that important in this question), we can see that train loss decreases and eval_loss increases. Thus, overfitting starts. The transformers trainer in the end loads the model from epoch 8, checkpoint “-14928”, as the f1 score is a bit highea. I was wondering, in theory, wouldn’t be the model from epoch 6 be better suited, as it did not overfit? Or does one really go for the f1 metric here even though the model did overfit? (the eval loss decreased in epochs <6 constantly).

The test_loss from the second checkpoint, which is then loaded as the “best”, is 0.128. Is it possible to lower that using the first checkpoint which should be the better model anyway?

**checkpoint-11196:**
{'loss': 0.0638, 'learning_rate': 8.666799323450404e-06, 'epoch': 6.0}

{'eval_loss': 0.09599845856428146, 'eval_accuracy': 0.9749235986101227, 'eval_precision': 0.9648319293367138, 'eval_recall': 0.9858766505097777, 'eval_f1': 0.9752407721241682, 'eval_runtime': 282.2294, 'eval_samples_per_second': 84.637, 'eval_steps_per_second': 2.647, 'epoch': 6.0}

VS.

**checkpoint-14928:**
{'loss': 0.0312, 'learning_rate': 7.4291115311909265e-06, 'epoch': 8.0}

{'eval_loss': 0.12377820163965225, 'eval_accuracy': 0.976305103194206, 'eval_precision': 0.9719324391455539, 'eval_recall': 0.9810295838208257, 'eval_f1': 0.9764598236566295, 'eval_runtime': 276.7619, 'eval_samples_per_second': 86.309, 'eval_steps_per_second': 2.699, 'epoch': 8.0}

Topic		Replies	Views
Unexpected behavior of load_best_model_at_end in Trainer (or am I doing it wrong?) 🤗Transformers	2	55	March 25, 2025
How to load the best model based on loss and eval_loss 🤗Transformers	0	1204	February 2, 2022
Autogenerated model cards not showing the best metrics when using "load_best_model_at_end=True" 🤗Hub	0	531	December 24, 2022
Questions about default checkpointing behavior (train v. val) Beginners	4	1001	October 16, 2023
Increasing validation loss even with small learning rate - RoBERTa Models	0	1125	March 1, 2021

Trainer load_best_model f1 score vs. loss and overfitting

Related topics