Token classification example script metrics improve despite overfit

OliverBe · October 28, 2021, 10:14am

Hello everyone,

I’d like to compare different models on various NER tasks. I found an example script for token classification here (notebooks/token_classification.ipynb at main · huggingface/notebooks · GitHub)

This script, however, appears to compute precision/recall/F1 using the training dataset rather than the validation dataset.

Starting with epoch 3, the validation loss increases (overfit?), but the scores continue to improve. The bert-base-cased model trained on the conll2003 dataset looks like this:
BERT-Base-Overfit-CONLL-metrics

Trainer.evaluate() also appears to select the training dataset rather than the evaluation dataset. How can I adjust the settings to calculate the metrics on the evaluation dataset?

Best,
Oliver

Topic		Replies	Views
Token classification metric 🤗Transformers	0	24	October 3, 2024
Transformers Text Classification Example: Compute Precision, Recall and F1 🤗Transformers	0	1394	January 4, 2022
Is it possible to get more results from training and evaluating a model, beside loss? 🤗Transformers	0	250	April 3, 2023
[HELP] Model Evaluation for NER yields different results (sklearn vs metric.compute()) 🤗Transformers	3	2722	January 31, 2023
Text classifier is trained incorrectly using BERT transformers (f1 = 0) for a certain amount of dataset 🤗Transformers	2	828	August 31, 2023

Token classification example script metrics improve despite overfit

Related topics