i tried to make a token classification task by fine-tuning distilbert and training with tensorflow.
and I used accuracy as a metric in the compile method.
i found the loss was decreasing but the accuracy didnât increase so I realise that accuracy metric is not appropriate for NER tasks I searched and I found that f1-score is a good choose but I got an error
ValueError Traceback (most recent call last)
in <cell line: 1>()
----> 1 history = model.fit(train_dataset.take(1) , validation_data = val_dataset , epochs = training_args[âepochsâ], callbacks= metric_callback)
5 frames
/usr/local/lib/python3.10/dist-packages/evaluate/module.py in add_batch(self, predictions, references, **kwargs)
544 f"Input references: {summarize_if_long_list(references)}"
545 )
â 546 raise ValueError(error_msg) from None
547
548 def add(self, *, prediction=None, reference=None, **kwargs):
ValueError: Predictions and/or references donât match the expected format.
Expected format: {âpredictionsâ: Sequence(feature=Value(dtype=âstringâ, id=âlabelâ), length=-1, id=âsequenceâ), âreferencesâ: Sequence(feature=Value(dtype=âstringâ, id=âlabelâ), length=-1, id=âsequenceâ)},
so I need an metric for NER task