Combining metrics for multiclass predictions evaluations

Good afternoon:

I’ve started playing with the evaluate library following the quick tour with a multi classification problem and I’ve a few doubts. I’d like to use for example metrics = evaluate.combine(['precision', 'recall']), but when calling metrics.compute(references=[2, 2, 1, 0], predictions=[2, 1, 1, 2], average='weighted') it seems not getting the average argument:
ValueError: Target is multiclass but average='binary'. Please choose another average setting, one of [None, 'micro', 'macro', 'weighted'].
Not a big concern as I can compute the metrics individually, but I was wondering if there’s a way to do that.

On the other hand, same topic but regarding the evaluator approach. Is it possible to combine metrics while calling for example evaluate.load('text-classification').compute(...) and/or to pass the needed average strategy for multi classification problems?.

Best regards and thanks for the great work.

Indeed, passing additional kwargs is an issue at the moment. This PR should help make it easier: Refactor kwargs and configs by lvwerra · Pull Request #188 · huggingface/evaluate · GitHub

Instead of passing the settings during compute you can already pass them when loading a metric. E.g. the following would then work:

metrics = evaluate.combine([
    evaluate.load("precision", average="weighted"),
    evaluate.load("recall", average="weighted")
])

And this would then also be compatible with the evaluator. Hope we can finish this in the next week or so.

Hi @Ivwerra, thanks for your reply. I’ll take a look and try it when it’s available

@lvwerra I’m excited to give evaluate a try! I am getting the same error as above:

Target is multiclass but average='binary'. Please choose another average setting, one of [None, 'micro', 'macro', 'weighted'].
  File "/home/aclifton/.cache/huggingface/modules/evaluate_modules/metrics/evaluate-metric--f1/0ca73f6cf92ef5a268320c697f7b940d1030f8471714bffdb6856c641b818974/f1.py", line 127, in _compute
    score = f1_score(
  File "/home/aclifton/rf_fp/run_training_w_evaluate.py", line 306, in run_training_pipeline
    eval_results = clf_metrics.compute()
  File "/home/aclifton/rf_fp/run_training_w_evaluate.py", line 445, in <module>
    run_training_pipeline(tmp_dict)

with the following set up:

import evaluate
import torch

all_preds = torch.tensor((), device='cpu')
preds_labels = torch.tensor((), device='cpu')
clf_metrics = evaluate.combine([
    evaluate.load('accuracy'), 
    evaluate.load('f1', average='macro'), 
    evaluate.load('precision', average='macro'), 
    evaluate.load('recall', average='macro')
    ])


for batch in eval_dataloader:
    batch = {k: v.to('cpu') for k, v in batch.items()}
    with torch.no_grad():
        outputs = my_model(**batch)
    
    logits = outputs['logits']
    predictions = torch.argmax(logits, dim=-1)
    all_preds = torch.cat((all_preds, predictions))
    preds_labels = torch.cat((preds_labels, batch['labelz']))
    clf_metrics.add_batch(predictions=predictions, references=batch['labelz'])
        
eval_results = clf_metrics.compute()

I’m using:
evaluate: 0.2.2
python: 3.9.7

Perhaps I misunderstood your reply above, but I assumed that the evaluate.load('f1', average='macro') would have worked.

The week is not over :stuck_out_tongue:. The PR I am still working on the PR as it requires quite a lot of changes. Once it is merged then it should work.

1 Like

@lvwerra my apologies. I completely misread the reply. Thank you for your hard work and I look forward to the update!

1 Like

@lvwerra It seems that even though the PR was merged and released, specifying kwargs (at least average) when loading a metric is still not working. Here’s what I tried:

metric = evaluate.load('f1', average='macro')
metric.compute(references=[2, 2, 1, 0], predictions=[2, 1, 1, 2])

It still raises Target is multiclass but average='binary' error. I can reproduce the error in this notebook.

This is very frustrating as it still doesn’t work on transformer 4.25.1 with evaluate 0.3.0 :slight_smile:

@lvwerra

def compute_metrics (eval_pred):
    #metric = evaluate.load("f1")
    metric = evaluate.combine([
        evaluate.load("f1", average="micro"),
        evaluate.load("precision", average="micro"),
        evaluate.load("recall", average="micro")
    ])
    logits, labels = eval_pred
    preds = np.argmax(logits, axis = -1)
    return metric.compute(predictions=preds, references = labels)

will fail with the same error

def compute_metrics (eval_pred):
    #metric = evaluate.load("f1")
    metric = evaluate.combine([
        evaluate.load("f1", average="micro"),
        evaluate.load("precision", average="micro"),
        evaluate.load("recall", average="micro")
    ])
    logits, labels = eval_pred
    preds = np.argmax(logits, axis = -1)
    return metric.compute(predictions=preds, references = labels, average="micro")

will fail with the same error

def compute_metrics (eval_pred):
    metric = evaluate.load("f1", average="micro")
    logits, labels = eval_pred
    preds = np.argmax(logits, axis = -1)
    return metric.compute(predictions=preds, references = labels)

will fail


def compute_metrics (eval_pred):
    metric = evaluate.load("f1")
    logits, labels = eval_pred
    preds = np.argmax(logits, axis = -1)
    return metric.compute(predictions=preds, references = labels, average="micro")

will work but this is just for a single metric, need multiple metrics

Same problem.
trying to use it in Image Classification pipeline like below but still giving the error.

f1_metric = load("f1", "multiclass", average = 'macro')
accuracy_metric = load('accuracy')
eval_results = task_evaluator.compute(
    model_or_pipeline=pipe,
    data=test_ds,
    metric=combine([accuracy_metric, f1_metric]),
    label_mapping=pipe.model.config.label2id,
)

Hi @lvwerra, it would be great if you could reply to this as I can see other questions are also ignored. If you are not the right person, would you please let use whom should we talk to?

Thank you

Hi @Haonan, unfortunately, we had to put that feature on hold for a bit, due to an issue with backwards compatibility.

You could just load all the metrics independently inside the compute_metrics function:

f1_metric = evaluate.load("f1")
recall_metric = evaluate.load("recall")

def compute_metrics (eval_pred):
    logits, labels = eval_pred
    preds = np.argmax(logits, axis = -1)
    
    results = {}
    results.update(f1_metric.compute(predictions=preds, references = labels, average="micro"))
    results.update(recall_metric.compute(predictions=preds, references = labels, average="micro"))
    return results

Note, that I would load the metric outside the function otherwise they will be loaded at each evaluation step.

For using them with combine and in the evaluator we need to probably wrap the metric in a class.


class ConfiguredMetric:
    def __init__(self, metric, *metric_args, **metric_kwargs):
        self.metric = metric
        self.metric_args = metric_args
        self.metric_kwargs = metric_kwargs
    
    def add(self, *args, **kwargs):
        return self.metric.add(*args, **kwargs)
    
    def add_batch(self, *args, **kwargs):
        return self.metric.add_batch(*args, **kwargs)

    def compute(self, *args, **kwargs):
        return self.metric.compute(*args, *self.metric_args, **kwargs, **self.metric_kwargs)

    @property
    def name(self):
        return self.metric.name

    def _feature_names(self):
        return self.metric._feature_names()

With that you should be able to do the following:

evaluate.combine([
    evaluate.load('accuracy'), 
    ConfiguredMetric(evaluate.load('f1'), average='macro')
])

Maybe this needs some tweaking but that should roughly be a temporary solution.

1 Like

Big thanks Ivwerra, will try out, cheers

Thank you for this walkthrough! Unfortunately, after implementing this code for the train.evaluat(), I’m still getting another error:

_compute() got an unexpected keyword argument 'average'

Do you know what might be causing this?