I’d like to compare different models on various NER tasks. I found an example script for token classification here (notebooks/token_classification.ipynb at master · huggingface/notebooks · GitHub)
This script, however, appears to compute precision/recall/F1 using the training dataset rather than the validation dataset.
Starting with epoch 3, the validation loss increases (overfit?), but the scores continue to improve. The bert-base-cased model trained on the conll2003 dataset looks like this:
Trainer.evaluate() also appears to select the training dataset rather than the evaluation dataset. How can I adjust the settings to calculate the metrics on the evaluation dataset?