Hello
I’m attempting to fine-tune distilbert-base-uncased
model for token classification with custom entities. The dataset has the annotated tags in IOB-format.
I imported and created a huggingface DatasetDict following the documentation and obtained this:
DatasetDict({
train: Dataset({
features: ['id', 'tokens', 'ner_tags'],
num_rows: 100
})
test: Dataset({
features: ['id', 'tokens', 'ner_tags'],
num_rows: 431
})
dev: Dataset({
features: ['id', 'tokens', 'ner_tags'],
num_rows: 350
})
})
Each dataset Features is defined as below:
Features({
'id': Value(dtype='int32', id=None),
'tokens': Sequence(feature=Value(dtype='string', id=None), id=None),
'ner_tags': Sequence(feature=ClassLabel(num_classes=ner_tags_num_classes,
names=ner_tags_names, id=None), id=None)
})
And ner tags mapping is the following:
{0: 'O', 1: 'B-product', 2: 'I-product', 3: 'B-field', 4: 'I-field',
5: 'B-task', 6: 'I-task', 7: 'B-researcher', 8: 'I-researcher', 9: 'B-university',
10: 'B-programlang', 11: 'B-algorithm', 12: 'I-algorithm', 13: 'B-misc', 14: 'I-misc',
15: 'I-university', 16: 'B-metrics', 17: 'B-organisation', 18: 'I-organisation', 19: 'I-metrics',
20: 'B-conference', 21: 'I-conference', 22: 'B-country', 23: 'I-programlang', 24: 'B-location',
25: 'B-person', 26: 'I-person', 27: 'I-country', 28: 'I-location'}
I followed this tutorial for every step.
However, when it comes to compute the metric by using seqeval on the test set, this is the output that I get:
/opt/conda/lib/python3.7/site-packages/seqeval/metrics/v1.py:57: UndefinedMetricWarning: Precision and F-score are ill-defined and being set to 0.0 in labels with no predicted samples.
Use `zero_division` parameter to control this behavior.
_warn_prf(average, modifier, msg_start, len(result))
{'algorithm': {'precision': 0.0, 'recall': 0.0, 'f1': 0.0, 'number': 191},
'conference': {'precision': 0.0, 'recall': 0.0, 'f1': 0.0, 'number': 391},
'country': {'precision': 0.0, 'recall': 0.0, 'f1': 0.0, 'number': 57},
'field': {'precision': 0.0, 'recall': 0.0, 'f1': 0.0, 'number': 93},
'location': {'precision': 0.0, 'recall': 0.0, 'f1': 0.0, 'number': 412},
'metrics': {'precision': 0.0, 'recall': 0.0, 'f1': 0.0, 'number': 521},
'misc': {'precision': 0.0, 'recall': 0.0, 'f1': 0.0, 'number': 181},
'organisation': {'precision': 0.0, 'recall': 0.0, 'f1': 0.0, 'number': 67},
'person': {'precision': 0.0, 'recall': 0.0, 'f1': 0.0, 'number': 219},
'product': {'precision': 0.0, 'recall': 0.0, 'f1': 0.0, 'number': 177},
'programlang': {'precision': 0.0, 'recall': 0.0, 'f1': 0.0, 'number': 201},
'researcher': {'precision': 0.0, 'recall': 0.0, 'f1': 0.0, 'number': 207},
'task': {'precision': 0.0, 'recall': 0.0, 'f1': 0.0, 'number': 44},
'university': {'precision': 0.0, 'recall': 0.0, 'f1': 0.0, 'number': 183},
'overall_precision': 0.0,
'overall_recall': 0.0,
'overall_f1': 0.0,
'overall_accuracy': 0.6539285236246821}
I have absolutely no idea how to solve this problem.
- Is the model performing so bad that I get ill-defined precision and f-score?
- Did I have commit some error when I created the dataset?
- Do I have to look at the fine-tuning part of the code or only the evaluation part?
- Is there another way to evaluate the model using a test set which is a tensorflow dataset?