Calculate precision, recall, f1 score for custom dataset for multiclass classification

xap · May 3, 2022, 6:52pm

I am trying to do multiclass classification for the sentence pair task. I uploaded my custom dataset of train and test separately in the hugging face data set and trained my model and tested it and was trying to see the f1 score and accuracy.

I tried

metric = load_metric("glue", "mrpc")

metric.add_batch(predictions=predictions, references=refernces)

but it says

ValueError: Target is multiclass but average='binary'. Please choose another average setting, one of [None, 'micro', 'macro', 'weighted'].

How can I fix this and print precision, recall, and f1 score?

xap · May 3, 2022, 7:42pm

@sgugger any help on this?

merve · May 4, 2022, 10:55am

Hello,
I didn’t try on my local but I think you can pass average in **kwargs, maybe if you could do:

metric.add_batch(predictions=predictions, references=references, average="micro")

should work. The binary average works for, as said, binary class problems.

xap · May 4, 2022, 3:59pm

@merve I tried it but doesn’t work

merve · May 5, 2022, 9:44am

Okay I realized what was wrong.
So MRPC itself is a binary classification task, so your dataset has to have binary target. You’re loading MRPC as metric yet it says your original dataset is multiclass. Is it like that?

Apparently you can’t change the average argument for a good reason.

xap · May 5, 2022, 5:13pm

@merve Do you have any idea which metric should I use for multiclass classification if I want to have all the results of precision, recall, f1, and accuracy.

gautamshahi · May 24, 2022, 5:41pm

did you able to solve the issue? both are not working

xap · May 24, 2022, 6:18pm

Hey, you can use the following:

from datasets import load_metric

precision = precision_metric.compute(predictions=y_pred, references=y_test,average="weighted")["precision"]

You can do the same for precision and recall too. If you want another measure like micro or macro change, the value of average

gautamshahi · May 24, 2022, 6:27pm

how to generate y_pred here, I try to do it but it’s not working.

xap · May 24, 2022, 7:55pm

y_pred is the prediction of your model

BrunoSE · November 23, 2022, 4:14pm

Hello! Trying to use recall for a BERT fine-tuning notebook. I just want to understand why is it that after .compute(pred, references, average) we query for [“precision”]. If it’s recall should I input [“recall”] after the .compute() method?

EDIT: my script for multiclass BERT fine tuning was able to run successfully with the following:

from datasets import load_metric
metric = load_metric("recall")
def compute_metrics(eval_pred):
    logits, labels = eval_pred
    predictions = np.argmax(logits, axis=-1)
    return metric.compute(predictions=predictions, references=labels, average="weighted")

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_datasets['train'],
    eval_dataset=tokenized_datasets['val'],
    compute_metrics=compute_metrics,
)

vpkprasanna · November 26, 2023, 2:14pm

any new module added for multi-class to achieve the above one ?

vpkprasanna · November 26, 2023, 2:20pm

Found a hack like this would work for temporary

def compute_metrics(eval_pred):
    metric1 = evaluate.load("precision")
    metric2 = evaluate.load("recall")
    metric3 = evaluate.load("f1")
    metric4 = evaluate.load("accuracy")

    logits, labels = eval_pred
    predictions = np.argmax(logits, axis=-1)

    precision = metric1.compute(predictions=predictions, references=labels,
                                average="micro")["precision"]
    recall = metric2.compute(predictions=predictions, references=labels,
                             average="micro")["recall"]
    f1 = metric3.compute(predictions=predictions, references=labels,
                         average="micro")["f1"]
    accuracy = metric4.compute(predictions=predictions, references=labels)[
        "accuracy"]

    return {"precision": precision, "recall": recall, "f1": f1,
            "accuracy": accuracy}

dvdblk · June 13, 2024, 12:33pm

I wrote a StackOverflow answer on how to fix this error step by step.

Complete working code snippet is here:

import datasets
import evaluate
from evaluate import evaluator, Metric
from sklearn.metrics import accuracy_score


class MulticlassAccuracy(Metric):
    """Workaround for the default Accuracy class which doesn't support passing 'average' to the compute method."""

    def _info(self):
        return evaluate.MetricInfo(
            description="Accuracy",
            citation="",
            inputs_description="",
            features=datasets.Features(
                {
                    "predictions": datasets.Sequence(datasets.Value("int32")),
                    "references": datasets.Sequence(datasets.Value("int32")),
                }
                if self.config_name == "multilabel"
                else {
                    "predictions": datasets.Value("int32"),
                    "references": datasets.Value("int32"),
                }
            ),
            reference_urls=["https://scikit-learn.org/stable/modules/generated/sklearn.metrics.accuracy_score.html"],
        )

    def _compute(self, predictions, references, normalize=True, sample_weight=None, **kwargs):
        # take **kwargs to avoid breaking when the metric is used with a compute method that takes additional arguments
        return {
            "accuracy": float(
                accuracy_score(references, predictions, normalize=normalize, sample_weight=sample_weight)
            )
        }

task_evaluator = evaluator("text-classification")
task_evaluator.METRIC_KWARGS = {"average": "weighted"}
metrics_dict = {
    "accuracy": MulticlassAccuracy(),
    "precision": "precision",
    "recall": "recall",
    "f1": "f1",
}

eval_results = task_evaluator.compute(
    model_or_pipeline="lvwerra/distilbert-imdb",
    data=data,
    metric=evaluate.combine(metrics_dict),
    label_mapping={"NEGATIVE": 0, "POSITIVE": 1, "NEUTRAL": 2}
)
print(eval_results)

Topic		Replies	Views
How to compute accuracy and precision for each class in text classification task? Beginners	2	864	October 30, 2023
Transformers longformer classification problem with f1, precision and recall classification Models	0	398	February 14, 2022
Adding accuracy, precision, recall and f1 score metrics during training Beginners	1	5214	March 9, 2023
Metric.compute() Beginners	0	1416	October 12, 2022
Sample evaluation script on custom dataset Beginners	10	1616	December 14, 2021

Calculate precision, recall, f1 score for custom dataset for multiclass classification

Related topics