Hi, I have trained a roberta-large and specified â€śload_best_model_at_end=Trueâ€ť and â€śmetric_for_best_model=f1â€ť. During training, I can see overfitting after the 6 epoch, which is the sweetspot. In Epoch 8, which is the next one to evaluate due to gradient accumulation (just a side info, not that important in this question), we can see that train loss decreases and eval_loss increases. Thus, overfitting starts. The transformers trainer in the end loads the model from epoch 8, checkpoint â€ś-14928â€ť, as the f1 score is a bit highea. I was wondering, in theory, wouldnâ€™t be the model from epoch 6 be better suited, as it did not overfit? Or does one really go for the f1 metric here even though the model did overfit? (the eval loss decreased in epochs <6 constantly).

The test_loss from the second checkpoint, which is then loaded as the â€śbestâ€ť, is 0.128. Is it possible to lower that using the first checkpoint which should be the better model anyway?

```
**checkpoint-11196:**
{'loss': 0.0638, 'learning_rate': 8.666799323450404e-06, 'epoch': 6.0}
{'eval_loss': 0.09599845856428146, 'eval_accuracy': 0.9749235986101227, 'eval_precision': 0.9648319293367138, 'eval_recall': 0.9858766505097777, 'eval_f1': 0.9752407721241682, 'eval_runtime': 282.2294, 'eval_samples_per_second': 84.637, 'eval_steps_per_second': 2.647, 'epoch': 6.0}
VS.
**checkpoint-14928:**
{'loss': 0.0312, 'learning_rate': 7.4291115311909265e-06, 'epoch': 8.0}
{'eval_loss': 0.12377820163965225, 'eval_accuracy': 0.976305103194206, 'eval_precision': 0.9719324391455539, 'eval_recall': 0.9810295838208257, 'eval_f1': 0.9764598236566295, 'eval_runtime': 276.7619, 'eval_samples_per_second': 86.309, 'eval_steps_per_second': 2.699, 'epoch': 8.0}
```