Get multiple metrics when using the huggingface trainer

Hi all, I’d like to ask if there is any way to get multiple metrics during fine-tuning a model. Now I’m training a model for performing the GLUE-STS task, so I’ve been trying to get the pearsonr and f1score as the evaluation metrics. I referred to the link (Log multiple metrics while training) in order to achieve it, but in the middle of the second training epoch, it gave me the following error:

Trainer is attempting to log a value of "{'pearsonr': 0.8609849499038021}" of type <class 'dict'> for key "eval/pearsonr" as a scalar. This invocation of Tensorboard's writer.add_scalar() is incorrect so we dropped this attribute.
Trainer is attempting to log a value of "{'f1': 0.8307692307692308}" of type <class 'dict'> for key "eval/f1" as a scalar. This invocation of Tensorboard's writer.add_scalar() is incorrect so we dropped this attribute.

---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-38-3435b262f1ae> in <module>()
----> 1 trainer.train()

2 frames
/usr/local/lib/python3.7/dist-packages/transformers/trainer.py in _save_checkpoint(self, model, trial, metrics)
   1724                 self.state.best_metric is None
   1725                 or self.state.best_model_checkpoint is None
-> 1726                 or operator(metric_value, self.state.best_metric)
   1727             ):
   1728                 self.state.best_metric = metric_value

TypeError: '>' not supported between instances of 'dict' and 'dict'

And this is my compute_metrics code snippet:

def compute_metrics(eval_pred):
    predictions, labels = eval_pred
    predictions = predictions[:, 0]
    binary_predictions = [1.0 if prediction >= 3.0 else 0.0 for prediction in predictions]
    binary_labels = [1.0 if label >= 3.0 else 0.0 for label in labels]
    pr = metric_pearsonr.compute(predictions=predictions, references=labels)
    f1 = metric_f1.compute(predictions=binary_predictions, references=binary_labels)

    return {"pearsonr": pr, "f1": f1} 

It works fine if I only use one of the metrics like return pr or return f1. Does anyone have suggestions about this issue? I’d really appreciate it.

The first line in your error message indicates that it expects a scalar instead of a dictionary (Trainer is attempting to log a value of "{'pearsonr': 0.8609849499038021}" of type <class 'dict'> for key "eval/pearsonr" as a scalar.).

Can you share why you want to return the values in a dictionary and not as values (i.e. why not use return pr, f1)?

Hi Thanks for the reply. There’s no certain reason that I used a dictionary because I just followed the way in this discussion (Log multiple metrics while training). I also tried return pr, f1 as you suggested but it showed me another error:

---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-35-3435b262f1ae> in <module>()
----> 1 trainer.train()

3 frames
/usr/local/lib/python3.7/dist-packages/transformers/trainer.py in evaluation_loop(self, dataloader, description, prediction_loss_only, ignore_keys, metric_key_prefix)
   2511 
   2512         if all_losses is not None:
-> 2513             metrics[f"{metric_key_prefix}_loss"] = all_losses.mean().item()
   2514 
   2515         # Prefix all keys with metric_key_prefix + '_'

TypeError: 'tuple' object does not support item assignment

I don’t know if this can help you in some way. I have implemented more metrics but not using the trainer by modifying the evaluation cycle like this:

accuracy = load_metric("accuracy")
precision = load_metric("precision")
recall = load_metric("recall")
f1 = load_metric("f1")

metrics = [accuracy, precision, recall, f1]

model.eval()
for step, batch in enumerate(eval_dataloader):
    outputs = model(**batch)
    predictions = outputs.logits.argmax(dim=-1) if not is_regression else outputs.logits.squeeze()
    for metric in metrics:
        metric.add_batch(
            predictions=accelerator.gather(predictions),
            references=accelerator.gather(batch["labels"]),
         )

logger.info(f"epoch {epoch+1}: train loss {loss}")
for metric in metrics:
    if metric.name == "accuracy":
        eval_metric = metric.compute()
        logger.info(f"{eval_metric}")
    else:
        eval_metric = metric.compute(average=None)
        logger.info(f"{eval_metric}")
    if metric.name == "f1":
        avg_f1 = sum(values)/2
        logger.info(f"Average f1: {avg_f1}")