[HELP] Model Evaluation for NER yields different results (sklearn vs metric.compute())

Calin · October 5, 2021, 2:55pm

I am using a model for evaluating the capacity of my Transformer → AutoModel → XLNetForTokenClassification.

I am using the exact evaluation like in this case, the tutorial Sylvain Gugger created : Google Colab

I have a dilemma which is the following:

metric = load_metric("seqeval")
results = metric.compute(predictions=[true_predictions], references=[true_labels])

and

classification_report(true_labels, true_predictions) (from sklearn.metrics) yield different scores.

In essence, the classification report yields better Recall, Precision and F1-Score.

For HuggingFace built-in metric:

{'S': {'precision': 0.7408599678086917,
  'recall': 0.794182893763865,
  'f1': 0.7665952890792291,
  'number': 4057},
 'overall_precision': 0.7408599678086917,
 'overall_recall': 0.794182893763865,
 'overall_f1': 0.7665952890792291,
 'overall_accuracy': 0.9210345258944208}

Sklearn classification_report:

                      precision  recall     f1-score  support

        label_1       0.80      0.85      0.83         4051
        label_2       0.84      0.82      0.83         4056
        label_3       0.96      0.95      0.95         23869

        accuracy                      0.92               31976
        macro avg           0.87      0.87      0.87     31976
        weighted avg        0.92      0.92      0.92     31976

Can anyone tell me where this difference comes from?

Note that I pass exactly the same lists of prediction and labels in metric.compute() and classification_report().

I also manually went through every example of the validation set and predicted with my loaded PyTorch model (so not directly from Trainer), and created the classification report. The metrics are the same with the sklearn classification report above, which means that the trainer.predict() and basic PyTorch predict predictions do not vary at all.

Akshayextreme · October 5, 2021, 3:47pm

I also faced same issue and did the same analysis you did. Would like to know why this happens

Calin · October 5, 2021, 3:52pm

Please upvote my question then to receive more attention.

jacklanda · January 31, 2023, 2:09pm

same issues to me.

Topic		Replies	Views
Getting the same value for all evaluation metrics Models	1	106	July 21, 2024
Evaluating huggingface transformer with trainer gives different results 🤗Transformers	0	915	March 22, 2023
Cannot see training accuracy, only validation accuracy 🤗Transformers	2	1285	February 20, 2024
Combine multiple metrics in compute_metrics for validation Beginners	1	897	June 4, 2024
Eval_pred vs. EvalPrediction confusion 🤗Transformers	0	864	August 5, 2023

[HELP] Model Evaluation for NER yields different results (sklearn vs metric.compute())

Related topics