Calculate precision, recall, f1 score for custom dataset for multiclass classification

I am trying to do multiclass classification for the sentence pair task. I uploaded my custom dataset of train and test separately in the hugging face data set and trained my model and tested it and was trying to see the f1 score and accuracy.

I tried

metric = load_metric("glue", "mrpc")

metric.add_batch(predictions=predictions, references=refernces)

but it says

ValueError: Target is multiclass but average='binary'. Please choose another average setting, one of [None, 'micro', 'macro', 'weighted'].

How can I fix this and print precision, recall, and f1 score?

@sgugger any help on this?

Hello,
I didn’t try on my local but I think you can pass average in **kwargs, maybe if you could do:

metric.add_batch(predictions=predictions, references=references, average="micro")

should work. The binary average works for, as said, binary class problems.

@merve I tried it but doesn’t work

Okay I realized what was wrong.
So MRPC itself is a binary classification task, so your dataset has to have binary target. You’re loading MRPC as metric yet it says your original dataset is multiclass. Is it like that?

Apparently you can’t change the average argument for a good reason.

1 Like

@merve Do you have any idea which metric should I use for multiclass classification if I want to have all the results of precision, recall, f1, and accuracy.

did you able to solve the issue? both are not working

Hey, you can use the following:

from datasets import load_metric

precision = precision_metric.compute(predictions=y_pred, references=y_test,average="weighted")["precision"]

You can do the same for precision and recall too. If you want another measure like micro or macro change, the value of average

2 Likes

how to generate y_pred here, I try to do it but it’s not working.

y_pred is the prediction of your model

Hello! Trying to use recall for a BERT fine-tuning notebook. I just want to understand why is it that after .compute(pred, references, average) we query for [“precision”]. If it’s recall should I input [“recall”] after the .compute() method?

EDIT: my script for multiclass BERT fine tuning was able to run successfully with the following:

from datasets import load_metric
metric = load_metric("recall")
def compute_metrics(eval_pred):
    logits, labels = eval_pred
    predictions = np.argmax(logits, axis=-1)
    return metric.compute(predictions=predictions, references=labels, average="weighted")

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_datasets['train'],
    eval_dataset=tokenized_datasets['val'],
    compute_metrics=compute_metrics,
)

1 Like

any new module added for multi-class to achieve the above one ?

Found a hack like this would work for temporary

def compute_metrics(eval_pred):
    metric1 = evaluate.load("precision")
    metric2 = evaluate.load("recall")
    metric3 = evaluate.load("f1")
    metric4 = evaluate.load("accuracy")

    logits, labels = eval_pred
    predictions = np.argmax(logits, axis=-1)

    precision = metric1.compute(predictions=predictions, references=labels,
                                average="micro")["precision"]
    recall = metric2.compute(predictions=predictions, references=labels,
                             average="micro")["recall"]
    f1 = metric3.compute(predictions=predictions, references=labels,
                         average="micro")["f1"]
    accuracy = metric4.compute(predictions=predictions, references=labels)[
        "accuracy"]

    return {"precision": precision, "recall": recall, "f1": f1,
            "accuracy": accuracy}

I wrote a StackOverflow answer on how to fix this error step by step.

Complete working code snippet is here:

import datasets
import evaluate
from evaluate import evaluator, Metric
from sklearn.metrics import accuracy_score


class MulticlassAccuracy(Metric):
    """Workaround for the default Accuracy class which doesn't support passing 'average' to the compute method."""

    def _info(self):
        return evaluate.MetricInfo(
            description="Accuracy",
            citation="",
            inputs_description="",
            features=datasets.Features(
                {
                    "predictions": datasets.Sequence(datasets.Value("int32")),
                    "references": datasets.Sequence(datasets.Value("int32")),
                }
                if self.config_name == "multilabel"
                else {
                    "predictions": datasets.Value("int32"),
                    "references": datasets.Value("int32"),
                }
            ),
            reference_urls=["https://scikit-learn.org/stable/modules/generated/sklearn.metrics.accuracy_score.html"],
        )

    def _compute(self, predictions, references, normalize=True, sample_weight=None, **kwargs):
        # take **kwargs to avoid breaking when the metric is used with a compute method that takes additional arguments
        return {
            "accuracy": float(
                accuracy_score(references, predictions, normalize=normalize, sample_weight=sample_weight)
            )
        }

task_evaluator = evaluator("text-classification")
task_evaluator.METRIC_KWARGS = {"average": "weighted"}
metrics_dict = {
    "accuracy": MulticlassAccuracy(),
    "precision": "precision",
    "recall": "recall",
    "f1": "f1",
}

eval_results = task_evaluator.compute(
    model_or_pipeline="lvwerra/distilbert-imdb",
    data=data,
    metric=evaluate.combine(metrics_dict),
    label_mapping={"NEGATIVE": 0, "POSITIVE": 1, "NEUTRAL": 2}
)
print(eval_results)