Trainer load_best_model f1 score vs. loss and overfitting

Hi, I have trained a roberta-large and specified “load_best_model_at_end=True” and “metric_for_best_model=f1”. During training, I can see overfitting after the 6 epoch, which is the sweetspot. In Epoch 8, which is the next one to evaluate due to gradient accumulation (just a side info, not that important in this question), we can see that train loss decreases and eval_loss increases. Thus, overfitting starts. The transformers trainer in the end loads the model from epoch 8, checkpoint “-14928”, as the f1 score is a bit highea. I was wondering, in theory, wouldn’t be the model from epoch 6 be better suited, as it did not overfit? Or does one really go for the f1 metric here even though the model did overfit? (the eval loss decreased in epochs <6 constantly).

The test_loss from the second checkpoint, which is then loaded as the “best”, is 0.128. Is it possible to lower that using the first checkpoint which should be the better model anyway?

{'loss': 0.0638, 'learning_rate': 8.666799323450404e-06, 'epoch': 6.0}

{'eval_loss': 0.09599845856428146, 'eval_accuracy': 0.9749235986101227, 'eval_precision': 0.9648319293367138, 'eval_recall': 0.9858766505097777, 'eval_f1': 0.9752407721241682, 'eval_runtime': 282.2294, 'eval_samples_per_second': 84.637, 'eval_steps_per_second': 2.647, 'epoch': 6.0}


{'loss': 0.0312, 'learning_rate': 7.4291115311909265e-06, 'epoch': 8.0}

{'eval_loss': 0.12377820163965225, 'eval_accuracy': 0.976305103194206, 'eval_precision': 0.9719324391455539, 'eval_recall': 0.9810295838208257, 'eval_f1': 0.9764598236566295, 'eval_runtime': 276.7619, 'eval_samples_per_second': 86.309, 'eval_steps_per_second': 2.699, 'epoch': 8.0}