xap
May 3, 2022, 6:52pm
1
I am trying to do multiclass classification for the sentence pair task. I uploaded my custom dataset of train and test separately in the hugging face data set and trained my model and tested it and was trying to see the f1 score and accuracy.
I tried
metric = load_metric("glue", "mrpc")
metric.add_batch(predictions=predictions, references=refernces)
but it says
ValueError: Target is multiclass but average='binary'. Please choose another average setting, one of [None, 'micro', 'macro', 'weighted'].
How can I fix this and print precision, recall, and f1 score?
xap
May 3, 2022, 7:42pm
2
@sgugger any help on this?
merve
May 4, 2022, 10:55am
3
Hello,
I didn’t try on my local but I think you can pass average
in **kwargs
, maybe if you could do:
metric.add_batch(predictions=predictions, references=references, average="micro")
should work. The binary average works for, as said, binary class problems.
xap
May 4, 2022, 3:59pm
4
@merve I tried it but doesn’t work
merve
May 5, 2022, 9:44am
5
Okay I realized what was wrong.
So MRPC itself is a binary classification task, so your dataset has to have binary target. You’re loading MRPC as metric yet it says your original dataset is multiclass. Is it like that?
Apparently you can’t change the average argument for a good reason.
1 Like
xap
May 5, 2022, 5:13pm
6
@merve Do you have any idea which metric should I use for multiclass classification if I want to have all the results of precision, recall, f1, and accuracy.
did you able to solve the issue? both are not working
xap
May 24, 2022, 6:18pm
9
Hey, you can use the following:
from datasets import load_metric
precision = precision_metric.compute(predictions=y_pred, references=y_test,average="weighted")["precision"]
You can do the same for precision and recall too. If you want another measure like micro or macro change, the value of average
2 Likes
how to generate y_pred here, I try to do it but it’s not working.
xap
May 24, 2022, 7:55pm
11
y_pred is the prediction of your model
BrunoSE
November 23, 2022, 4:14pm
12
Hello! Trying to use recall for a BERT fine-tuning notebook. I just want to understand why is it that after .compute(pred, references, average) we query for [“precision”]. If it’s recall should I input [“recall”] after the .compute() method?
EDIT: my script for multiclass BERT fine tuning was able to run successfully with the following:
from datasets import load_metric
metric = load_metric("recall")
def compute_metrics(eval_pred):
logits, labels = eval_pred
predictions = np.argmax(logits, axis=-1)
return metric.compute(predictions=predictions, references=labels, average="weighted")
trainer = Trainer(
model=model,
args=training_args,
train_dataset=tokenized_datasets['train'],
eval_dataset=tokenized_datasets['val'],
compute_metrics=compute_metrics,
)
1 Like
any new module added for multi-class to achieve the above one ?
Found a hack like this would work for temporary
def compute_metrics(eval_pred):
metric1 = evaluate.load("precision")
metric2 = evaluate.load("recall")
metric3 = evaluate.load("f1")
metric4 = evaluate.load("accuracy")
logits, labels = eval_pred
predictions = np.argmax(logits, axis=-1)
precision = metric1.compute(predictions=predictions, references=labels,
average="micro")["precision"]
recall = metric2.compute(predictions=predictions, references=labels,
average="micro")["recall"]
f1 = metric3.compute(predictions=predictions, references=labels,
average="micro")["f1"]
accuracy = metric4.compute(predictions=predictions, references=labels)[
"accuracy"]
return {"precision": precision, "recall": recall, "f1": f1,
"accuracy": accuracy}
dvdblk
June 13, 2024, 12:33pm
15
I wrote a StackOverflow answer on how to fix this error step by step.
Complete working code snippet is here:
import datasets
import evaluate
from evaluate import evaluator, Metric
from sklearn.metrics import accuracy_score
class MulticlassAccuracy(Metric):
"""Workaround for the default Accuracy class which doesn't support passing 'average' to the compute method."""
def _info(self):
return evaluate.MetricInfo(
description="Accuracy",
citation="",
inputs_description="",
features=datasets.Features(
{
"predictions": datasets.Sequence(datasets.Value("int32")),
"references": datasets.Sequence(datasets.Value("int32")),
}
if self.config_name == "multilabel"
else {
"predictions": datasets.Value("int32"),
"references": datasets.Value("int32"),
}
),
reference_urls=["https://scikit-learn.org/stable/modules/generated/sklearn.metrics.accuracy_score.html"],
)
def _compute(self, predictions, references, normalize=True, sample_weight=None, **kwargs):
# take **kwargs to avoid breaking when the metric is used with a compute method that takes additional arguments
return {
"accuracy": float(
accuracy_score(references, predictions, normalize=normalize, sample_weight=sample_weight)
)
}
task_evaluator = evaluator("text-classification")
task_evaluator.METRIC_KWARGS = {"average": "weighted"}
metrics_dict = {
"accuracy": MulticlassAccuracy(),
"precision": "precision",
"recall": "recall",
"f1": "f1",
}
eval_results = task_evaluator.compute(
model_or_pipeline="lvwerra/distilbert-imdb",
data=data,
metric=evaluate.combine(metrics_dict),
label_mapping={"NEGATIVE": 0, "POSITIVE": 1, "NEUTRAL": 2}
)
print(eval_results)