Combining metrics for multiclass predictions evaluations

Good afternoon:

I’ve started playing with the evaluate library following the quick tour with a multi classification problem and I’ve a few doubts. I’d like to use for example metrics = evaluate.combine(['precision', 'recall']), but when calling metrics.compute(references=[2, 2, 1, 0], predictions=[2, 1, 1, 2], average='weighted') it seems not getting the average argument:
ValueError: Target is multiclass but average='binary'. Please choose another average setting, one of [None, 'micro', 'macro', 'weighted'].
Not a big concern as I can compute the metrics individually, but I was wondering if there’s a way to do that.

On the other hand, same topic but regarding the evaluator approach. Is it possible to combine metrics while calling for example evaluate.load('text-classification').compute(...) and/or to pass the needed average strategy for multi classification problems?.

Best regards and thanks for the great work.

Indeed, passing additional kwargs is an issue at the moment. This PR should help make it easier: Refactor kwargs and configs by lvwerra · Pull Request #188 · huggingface/evaluate · GitHub

Instead of passing the settings during compute you can already pass them when loading a metric. E.g. the following would then work:

metrics = evaluate.combine([
    evaluate.load("precision", average="weighted"),
    evaluate.load("recall", average="weighted")
])

And this would then also be compatible with the evaluator. Hope we can finish this in the next week or so.

Hi @Ivwerra, thanks for your reply. I’ll take a look and try it when it’s available

@lvwerra I’m excited to give evaluate a try! I am getting the same error as above:

Target is multiclass but average='binary'. Please choose another average setting, one of [None, 'micro', 'macro', 'weighted'].
  File "/home/aclifton/.cache/huggingface/modules/evaluate_modules/metrics/evaluate-metric--f1/0ca73f6cf92ef5a268320c697f7b940d1030f8471714bffdb6856c641b818974/f1.py", line 127, in _compute
    score = f1_score(
  File "/home/aclifton/rf_fp/run_training_w_evaluate.py", line 306, in run_training_pipeline
    eval_results = clf_metrics.compute()
  File "/home/aclifton/rf_fp/run_training_w_evaluate.py", line 445, in <module>
    run_training_pipeline(tmp_dict)

with the following set up:

import evaluate
import torch

all_preds = torch.tensor((), device='cpu')
preds_labels = torch.tensor((), device='cpu')
clf_metrics = evaluate.combine([
    evaluate.load('accuracy'), 
    evaluate.load('f1', average='macro'), 
    evaluate.load('precision', average='macro'), 
    evaluate.load('recall', average='macro')
    ])


for batch in eval_dataloader:
    batch = {k: v.to('cpu') for k, v in batch.items()}
    with torch.no_grad():
        outputs = my_model(**batch)
    
    logits = outputs['logits']
    predictions = torch.argmax(logits, dim=-1)
    all_preds = torch.cat((all_preds, predictions))
    preds_labels = torch.cat((preds_labels, batch['labelz']))
    clf_metrics.add_batch(predictions=predictions, references=batch['labelz'])
        
eval_results = clf_metrics.compute()

I’m using:
evaluate: 0.2.2
python: 3.9.7

Perhaps I misunderstood your reply above, but I assumed that the evaluate.load('f1', average='macro') would have worked.

The week is not over :stuck_out_tongue:. The PR I am still working on the PR as it requires quite a lot of changes. Once it is merged then it should work.

1 Like

@lvwerra my apologies. I completely misread the reply. Thank you for your hard work and I look forward to the update!

1 Like

@lvwerra It seems that even though the PR was merged and released, specifying kwargs (at least average) when loading a metric is still not working. Here’s what I tried:

metric = evaluate.load('f1', average='macro')
metric.compute(references=[2, 2, 1, 0], predictions=[2, 1, 1, 2])

It still raises Target is multiclass but average='binary' error. I can reproduce the error in this notebook.

This is very frustrating as it still doesn’t work on transformer 4.25.1 with evaluate 0.3.0 :slight_smile:

@lvwerra

def compute_metrics (eval_pred):
    #metric = evaluate.load("f1")
    metric = evaluate.combine([
        evaluate.load("f1", average="micro"),
        evaluate.load("precision", average="micro"),
        evaluate.load("recall", average="micro")
    ])
    logits, labels = eval_pred
    preds = np.argmax(logits, axis = -1)
    return metric.compute(predictions=preds, references = labels)

will fail with the same error

def compute_metrics (eval_pred):
    #metric = evaluate.load("f1")
    metric = evaluate.combine([
        evaluate.load("f1", average="micro"),
        evaluate.load("precision", average="micro"),
        evaluate.load("recall", average="micro")
    ])
    logits, labels = eval_pred
    preds = np.argmax(logits, axis = -1)
    return metric.compute(predictions=preds, references = labels, average="micro")

will fail with the same error

def compute_metrics (eval_pred):
    metric = evaluate.load("f1", average="micro")
    logits, labels = eval_pred
    preds = np.argmax(logits, axis = -1)
    return metric.compute(predictions=preds, references = labels)

will fail


def compute_metrics (eval_pred):
    metric = evaluate.load("f1")
    logits, labels = eval_pred
    preds = np.argmax(logits, axis = -1)
    return metric.compute(predictions=preds, references = labels, average="micro")

will work but this is just for a single metric, need multiple metrics

Same problem.
trying to use it in Image Classification pipeline like below but still giving the error.

f1_metric = load("f1", "multiclass", average = 'macro')
accuracy_metric = load('accuracy')
eval_results = task_evaluator.compute(
    model_or_pipeline=pipe,
    data=test_ds,
    metric=combine([accuracy_metric, f1_metric]),
    label_mapping=pipe.model.config.label2id,
)

Hi @lvwerra, it would be great if you could reply to this as I can see other questions are also ignored. If you are not the right person, would you please let use whom should we talk to?

Thank you

Hi @Haonan, unfortunately, we had to put that feature on hold for a bit, due to an issue with backwards compatibility.

You could just load all the metrics independently inside the compute_metrics function:

f1_metric = evaluate.load("f1")
recall_metric = evaluate.load("recall")

def compute_metrics (eval_pred):
    logits, labels = eval_pred
    preds = np.argmax(logits, axis = -1)
    
    results = {}
    results.update(f1_metric.compute(predictions=preds, references = labels, average="micro"))
    results.update(recall_metric.compute(predictions=preds, references = labels, average="micro"))
    return results

Note, that I would load the metric outside the function otherwise they will be loaded at each evaluation step.

For using them with combine and in the evaluator we need to probably wrap the metric in a class.


class ConfiguredMetric:
    def __init__(self, metric, *metric_args, **metric_kwargs):
        self.metric = metric
        self.metric_args = metric_args
        self.metric_kwargs = metric_kwargs
    
    def add(self, *args, **kwargs):
        return self.metric.add(*args, **kwargs)
    
    def add_batch(self, *args, **kwargs):
        return self.metric.add_batch(*args, **kwargs)

    def compute(self, *args, **kwargs):
        return self.metric.compute(*args, *self.metric_args, **kwargs, **self.metric_kwargs)

    @property
    def name(self):
        return self.metric.name

    def _feature_names(self):
        return self.metric._feature_names()

With that you should be able to do the following:

evaluate.combine([
    evaluate.load('accuracy'), 
    ConfiguredMetric(evaluate.load('f1'), average='macro')
])

Maybe this needs some tweaking but that should roughly be a temporary solution.

1 Like

Big thanks Ivwerra, will try out, cheers

Thank you for this walkthrough! Unfortunately, after implementing this code for the train.evaluat(), I’m still getting another error:

_compute() got an unexpected keyword argument 'average'

Do you know what might be causing this?

1 Like

Hi @lvwerra,

I’m getting the same error as @Fredi.

I’m using Python 3.8.14 with:
Transformers==4.26.0
Evaluate==0.4.0

My code looks like this:

import evaluate
import numpy as np


f1_metric = evaluate.load("data/metrics/f1")
precision_metric = evaluate.load("data/metrics/precision")
recall_metric = evaluate.load("data/metrics/recall")
accuracy_metric = evaluate.load("data/metrics/accuracy")

def ccompute_metrics(preds, labels):
    results = {}
    results.update(f1_metric.compute(predictions=preds, references = labels, average="micro"))
    results.update(precision_metric.compute(predictions=preds, references = labels, average="micro"))
    results.update(recall_metric.compute(predictions=preds, references = labels, average="micro"))
    results.update(accuracy_metric.compute(predictions=preds, references = labels, average="micro"))
    return results

predictions = [1, 2, 4]
references = [1, 5, 4]

print(ccompute_metrics(predictions, references))

Any idea what I’m doing wrong here?

Could you check if all the metrics actually support the average keyword?

I practically gave up and just decided to take out the prediction and used sklearn!

eval_pred = self.trainer.predict(test_dataset=eval_dataset)

logits, labels = eval_pred[:2]
predictions = np.argmax(logits, axis=-1)
classification_report(y_true=labels, y_pred=predictions)

This is doing it for me!

1 Like

I have tried with “weighted” with no luck!

You’re right, accuracy does not support the average keyword. It does work now, thanks!

Hey @lvwerra
I am having a similar problem but none of the above solutions worked for me!

This is my code:

metric = evaluate.load(“f1”)

model.eval()
for batch in eval_dataloader:
batch = {k: v.to(device) for k, v in batch.items()}
with torch.no_grad():
outputs = model(**batch)

logits = outputs.logits
predictions = torch.argmax(logits, dim=-1)
metric.add_batch(predictions=predictions, references=batch["labels"])

final_score = metric.compute()

I also get the error, that I have multi-class, but the metric chosen is for binary cases and that I need to add the ‘average’. I tried all different things to add the ‘average’ argument in a way it works but I got only errors.

How can my code can be modified?

If you want to use the Evaluator class with combined metrics and custom METRIC_KWARGS , you can create a custom subclass of the appropriate Evaluator subclass for your task and override the METRIC_KWARGS attribute with your desired values. Here’s an example that shows how to do this for the text classification task:


from evaluate import evaluator

# Create a custom subclass of TextClassificationEvaluator
class CustomTextClassificationEvaluator(evaluator(task="text-classification").__class__):
    METRIC_KWARGS = {"average": "weighted"}

# Instantiate the custom evaluator
task_evaluator = CustomTextClassificationEvaluator()

# Load the desired metrics
f1_metric = evaluate.load('f1')
recall_metric = evaluate.load('recall')
precision_metric = evaluate.load("precision")
recall_metric = evaluate.load("recall")

# Create a pipeline for the text classification task
pipe = pipeline("text-classification", model="lvwerra/distilbert-imdb")

# Compute the evaluation results using the custom evaluator and combined metrics
eval_results = task_evaluator.compute(
    model_or_pipeline=pipe,
    input_column='text',
    metric=evaluate.combine([f1_metric, recall_metric, precision_metric, recall_metric]),
    data=dataset['train'].select(range(5)),
    label_mapping=label2id
)

In this example, we create a custom subclass of TextClassificationEvaluator called CustomTextClassificationEvaluator and override the METRIC_KWARGS attribute with a dictionary containing the custom argument "average" and its value "weighted". Then, we instantiate our custom evaluator, load the desired metrics, create a pipeline for the text classification task, and compute the evaluation results using our custom evaluator and combined metrics.

Note that this approach only works with metrics that support the "average" argument. If you try to use it with metrics that don’t support this argument, you may encounter errors.

Edit:

I forgot to mention that I’m using evaluate version 0.4.1.dev0

The instructions to install can be found here: