Token classification example script metrics improve despite overfit

Hello everyone,

I鈥檇 like to compare different models on various NER tasks. I found an example script for token classification here (notebooks/token_classification.ipynb at master 路 huggingface/notebooks 路 GitHub)

This script, however, appears to compute precision/recall/F1 using the training dataset rather than the validation dataset.

Starting with epoch 3, the validation loss increases (overfit?), but the scores continue to improve. The bert-base-cased model trained on the conll2003 dataset looks like this:

Trainer.evaluate() also appears to select the training dataset rather than the evaluation dataset. How can I adjust the settings to calculate the metrics on the evaluation dataset?