Combining metrics for multiclass predictions evaluations

I’ve started playing with the evaluate library following the quick tour with a multi classification problem and I’ve a few doubts. I’d like to use for example metrics = evaluate.combine(['precision', 'recall']), but when calling metrics.compute(references=[2, 2, 1, 0], predictions=[2, 1, 1, 2], average='weighted') it seems not getting the average argument:
ValueError: Target is multiclass but average='binary'. Please choose another average setting, one of [None, 'micro', 'macro', 'weighted'].
On the other hand, same topic but regarding the evaluator approach. Is it possible to combine metrics while calling for example evaluate.load('text-classification').compute(...) and/or to pass the needed average strategy for multi classification problems?.

Indeed, passing additional kwargs is an issue at the moment. This PR should help make it easier: Refactor kwargs and configs by lvwerra · Pull Request #188 · huggingface/evaluate · GitHub

Instead of passing the settings during compute you can already pass them when loading a metric. E.g. the following would then work:

metrics = evaluate.combine([
    evaluate.load("precision", average="weighted"),
    evaluate.load("recall", average="weighted")

And this would then also be compatible with the evaluator. Hope we can finish this in the next week or so.

Hi @Ivwerra, thanks for your reply. I’ll take a look and try it when it’s available

@lvwerra I’m excited to give evaluate a try! I am getting the same error as above:

Target is multiclass but average='binary'. Please choose another average setting, one of [None, 'micro', 'macro', 'weighted'].
  File "/home/aclifton/.cache/huggingface/modules/evaluate_modules/metrics/evaluate-metric--f1/0ca73f6cf92ef5a268320c697f7b940d1030f8471714bffdb6856c641b818974/", line 127, in _compute
    score = f1_score(
  File "/home/aclifton/rf_fp/", line 306, in run_training_pipeline
    eval_results = clf_metrics.compute()
  File "/home/aclifton/rf_fp/", line 445, in <module>

with the following set up:

import evaluate
import torch

all_preds = torch.tensor((), device='cpu')
preds_labels = torch.tensor((), device='cpu')
clf_metrics = evaluate.combine([
    evaluate.load('f1', average='macro'), 
    evaluate.load('precision', average='macro'), 
    evaluate.load('recall', average='macro')

for batch in eval_dataloader:
    batch = {k:'cpu') for k, v in batch.items()}
    with torch.no_grad():
        outputs = my_model(**batch)
    logits = outputs['logits']
    predictions = torch.argmax(logits, dim=-1)
    all_preds =, predictions))
    preds_labels =, batch['labelz']))
    clf_metrics.add_batch(predictions=predictions, references=batch['labelz'])
eval_results = clf_metrics.compute()

I’m using:
evaluate: 0.2.2
python: 3.9.7

Perhaps I misunderstood your reply above, but I assumed that the evaluate.load('f1', average='macro') would have worked.

The week is not over :stuck_out_tongue:. The PR I am still working on the PR as it requires quite a lot of changes. Once it is merged then it should work.

@lvwerra my apologies. I completely misread the reply. Thank you for your hard work and I look forward to the update!

