I am using a model for evaluating the capacity of my Transformer → AutoModel → XLNetForTokenClassification.
I am using the exact evaluation like in this case, the tutorial Sylvain Gugger created : Google Colab
I have a dilemma which is the following:
metric = load_metric("seqeval")
results = metric.compute(predictions=[true_predictions], references=[true_labels])
and
classification_report(true_labels, true_predictions) (from sklearn.metrics)
yield different scores.
In essence, the classification report yields better Recall, Precision and F1-Score.
For HuggingFace built-in metric:
{'S': {'precision': 0.7408599678086917,
'recall': 0.794182893763865,
'f1': 0.7665952890792291,
'number': 4057},
'overall_precision': 0.7408599678086917,
'overall_recall': 0.794182893763865,
'overall_f1': 0.7665952890792291,
'overall_accuracy': 0.9210345258944208}
Sklearn classification_report:
precision recall f1-score support
label_1 0.80 0.85 0.83 4051
label_2 0.84 0.82 0.83 4056
label_3 0.96 0.95 0.95 23869
accuracy 0.92 31976
macro avg 0.87 0.87 0.87 31976
weighted avg 0.92 0.92 0.92 31976
Can anyone tell me where this difference comes from?
Note that I pass exactly the same lists of prediction and labels in metric.compute()
and classification_report()
.
I also manually went through every example of the validation set and predicted with my loaded PyTorch model (so not directly from Trainer), and created the classification report. The metrics are the same with the sklearn classification report above, which means that the trainer.predict()
and basic PyTorch
predict predictions do not vary at all.