[HELP] Model Evaluation for NER yields different results (sklearn vs metric.compute())

I am using a model for evaluating the capacity of my Transformer → AutoModel → XLNetForTokenClassification.

I am using the exact evaluation like in this case, the tutorial Sylvain Gugger created : Google Colab

I have a dilemma which is the following:

metric = load_metric("seqeval")
results = metric.compute(predictions=[true_predictions], references=[true_labels])


classification_report(true_labels, true_predictions) (from sklearn.metrics) yield different scores.

In essence, the classification report yields better Recall, Precision and F1-Score.

For HuggingFace built-in metric:

{'S': {'precision': 0.7408599678086917,
  'recall': 0.794182893763865,
  'f1': 0.7665952890792291,
  'number': 4057},
 'overall_precision': 0.7408599678086917,
 'overall_recall': 0.794182893763865,
 'overall_f1': 0.7665952890792291,
 'overall_accuracy': 0.9210345258944208}

Sklearn classification_report:

                      precision  recall     f1-score  support

        label_1       0.80      0.85      0.83         4051
        label_2       0.84      0.82      0.83         4056
        label_3       0.96      0.95      0.95         23869

        accuracy                      0.92               31976
        macro avg           0.87      0.87      0.87     31976
        weighted avg        0.92      0.92      0.92     31976

Can anyone tell me where this difference comes from?

Note that I pass exactly the same lists of prediction and labels in metric.compute() and classification_report().

I also manually went through every example of the validation set and predicted with my loaded PyTorch model (so not directly from Trainer), and created the classification report. The metrics are the same with the sklearn classification report above, which means that the trainer.predict() and basic PyTorch predict predictions do not vary at all.


I also faced same issue and did the same analysis you did. Would like to know why this happens


Please upvote my question then to receive more attention.

1 Like