Fine-tuning Token Classification with custom entities: "UndefinedMetricWarning: Precision and F-score are ill-defined"

Hello :slight_smile:

I’m attempting to fine-tune distilbert-base-uncased model for token classification with custom entities. The dataset has the annotated tags in IOB-format.

I imported and created a huggingface DatasetDict following the documentation and obtained this:

DatasetDict({
    train: Dataset({
        features: ['id', 'tokens', 'ner_tags'],
        num_rows: 100
    })
    test: Dataset({
        features: ['id', 'tokens', 'ner_tags'],
        num_rows: 431
    })
    dev: Dataset({
        features: ['id', 'tokens', 'ner_tags'],
        num_rows: 350
    })
})

Each dataset Features is defined as below:

Features({
'id': Value(dtype='int32', id=None),
'tokens': Sequence(feature=Value(dtype='string', id=None), id=None),
'ner_tags': Sequence(feature=ClassLabel(num_classes=ner_tags_num_classes, 
names=ner_tags_names, id=None), id=None)
})

And ner tags mapping is the following:

{0: 'O', 1: 'B-product', 2: 'I-product', 3: 'B-field', 4: 'I-field', 
5: 'B-task', 6: 'I-task', 7: 'B-researcher', 8: 'I-researcher', 9: 'B-university', 
10: 'B-programlang', 11: 'B-algorithm', 12: 'I-algorithm', 13: 'B-misc', 14: 'I-misc', 
15: 'I-university', 16: 'B-metrics', 17: 'B-organisation', 18: 'I-organisation', 19: 'I-metrics', 
20: 'B-conference', 21: 'I-conference', 22: 'B-country', 23: 'I-programlang', 24: 'B-location', 
25: 'B-person', 26: 'I-person', 27: 'I-country', 28: 'I-location'}

I followed this tutorial for every step.
However, when it comes to compute the metric by using seqeval on the test set, this is the output that I get:

/opt/conda/lib/python3.7/site-packages/seqeval/metrics/v1.py:57: UndefinedMetricWarning: Precision and F-score are ill-defined and being set to 0.0 in labels with no predicted samples. 
Use `zero_division` parameter to control this behavior.
  _warn_prf(average, modifier, msg_start, len(result))
{'algorithm': {'precision': 0.0, 'recall': 0.0, 'f1': 0.0, 'number': 191},
 'conference': {'precision': 0.0, 'recall': 0.0, 'f1': 0.0, 'number': 391},
 'country': {'precision': 0.0, 'recall': 0.0, 'f1': 0.0, 'number': 57},
 'field': {'precision': 0.0, 'recall': 0.0, 'f1': 0.0, 'number': 93},
 'location': {'precision': 0.0, 'recall': 0.0, 'f1': 0.0, 'number': 412},
 'metrics': {'precision': 0.0, 'recall': 0.0, 'f1': 0.0, 'number': 521},
 'misc': {'precision': 0.0, 'recall': 0.0, 'f1': 0.0, 'number': 181},
 'organisation': {'precision': 0.0, 'recall': 0.0, 'f1': 0.0, 'number': 67},
 'person': {'precision': 0.0, 'recall': 0.0, 'f1': 0.0, 'number': 219},
 'product': {'precision': 0.0, 'recall': 0.0, 'f1': 0.0, 'number': 177},
 'programlang': {'precision': 0.0, 'recall': 0.0, 'f1': 0.0, 'number': 201},
 'researcher': {'precision': 0.0, 'recall': 0.0, 'f1': 0.0, 'number': 207},
 'task': {'precision': 0.0, 'recall': 0.0, 'f1': 0.0, 'number': 44},
 'university': {'precision': 0.0, 'recall': 0.0, 'f1': 0.0, 'number': 183},
 'overall_precision': 0.0,
 'overall_recall': 0.0,
 'overall_f1': 0.0,
 'overall_accuracy': 0.6539285236246821}

I have absolutely no idea how to solve this problem.

  • Is the model performing so bad that I get ill-defined precision and f-score?
  • Did I have commit some error when I created the dataset?
  • Do I have to look at the fine-tuning part of the code or only the evaluation part?
  • Is there another way to evaluate the model using a test set which is a tensorflow dataset?