Combining metrics for multiclass predictions evaluations

Good afternoon:

I’ve started playing with the evaluate library following the quick tour with a multi classification problem and I’ve a few doubts. I’d like to use for example metrics = evaluate.combine(['precision', 'recall']), but when calling metrics.compute(references=[2, 2, 1, 0], predictions=[2, 1, 1, 2], average='weighted') it seems not getting the average argument:
ValueError: Target is multiclass but average='binary'. Please choose another average setting, one of [None, 'micro', 'macro', 'weighted'].
Not a big concern as I can compute the metrics individually, but I was wondering if there’s a way to do that.

On the other hand, same topic but regarding the evaluator approach. Is it possible to combine metrics while calling for example evaluate.load('text-classification').compute(...) and/or to pass the needed average strategy for multi classification problems?.

Best regards and thanks for the great work.

Indeed, passing additional kwargs is an issue at the moment. This PR should help make it easier: Refactor kwargs and configs by lvwerra · Pull Request #188 · huggingface/evaluate · GitHub

Instead of passing the settings during compute you can already pass them when loading a metric. E.g. the following would then work:

metrics = evaluate.combine([
    evaluate.load("precision", average="weighted"),
    evaluate.load("recall", average="weighted")

And this would then also be compatible with the evaluator. Hope we can finish this in the next week or so.

Hi @Ivwerra, thanks for your reply. I’ll take a look and try it when it’s available

@lvwerra I’m excited to give evaluate a try! I am getting the same error as above:

Target is multiclass but average='binary'. Please choose another average setting, one of [None, 'micro', 'macro', 'weighted'].
  File "/home/aclifton/.cache/huggingface/modules/evaluate_modules/metrics/evaluate-metric--f1/0ca73f6cf92ef5a268320c697f7b940d1030f8471714bffdb6856c641b818974/", line 127, in _compute
    score = f1_score(
  File "/home/aclifton/rf_fp/", line 306, in run_training_pipeline
    eval_results = clf_metrics.compute()
  File "/home/aclifton/rf_fp/", line 445, in <module>

with the following set up:

import evaluate
import torch

all_preds = torch.tensor((), device='cpu')
preds_labels = torch.tensor((), device='cpu')
clf_metrics = evaluate.combine([
    evaluate.load('f1', average='macro'), 
    evaluate.load('precision', average='macro'), 
    evaluate.load('recall', average='macro')

for batch in eval_dataloader:
    batch = {k:'cpu') for k, v in batch.items()}
    with torch.no_grad():
        outputs = my_model(**batch)
    logits = outputs['logits']
    predictions = torch.argmax(logits, dim=-1)
    all_preds =, predictions))
    preds_labels =, batch['labelz']))
    clf_metrics.add_batch(predictions=predictions, references=batch['labelz'])
eval_results = clf_metrics.compute()

I’m using:
evaluate: 0.2.2
python: 3.9.7

Perhaps I misunderstood your reply above, but I assumed that the evaluate.load('f1', average='macro') would have worked.

The week is not over :stuck_out_tongue:. The PR I am still working on the PR as it requires quite a lot of changes. Once it is merged then it should work.

1 Like

@lvwerra my apologies. I completely misread the reply. Thank you for your hard work and I look forward to the update!

1 Like