Combine multiple metrics in compute_metrics for validation

Hi
I am trying to get precision / recall / accuracy scores computed during the evaluation of the model during training. My compute_metrics() function is as follow:

metric1 = evaluate.load("accuracy")
metric2 = evaluate.load("precision")
metric3 = evaluate.load("recall")
def compute_metrics(eval_pred):
    predictions, labels = eval_pred
    predictions = np.argmax(predictions, axis=1)
    #
    accuracy =  metric1.compute(predictions=predictions, references=labels)["accuracy"]
    precision = metric2.compute(predictions=predictions, references=labels,
                                average="micro")["precision"]
    recall = metric3.compute(predictions=predictions, references=labels,
                                average="micro")["recall"]
    #
    return {"precision": precision, "recall": recall, "accuracy": accuracy}

The TrainingArguments and Trainer are as follows :

training_args = TrainingArguments(
    output_dir=str(local_model_pth / model_name),
    learning_rate=2e-5,
    per_device_train_batch_size=8,
    per_device_eval_batch_size=8,
    num_train_epochs=1,
    weight_decay=0.01,
    evaluation_strategy="steps",
    eval_steps=200,
    save_steps=20000,
    save_strategy="steps",
    load_best_model_at_end=True,
    push_to_hub=False,
)
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_ds["train"],
    eval_dataset=tokenized_ds["test"],
    tokenizer=tokenizer,
    data_collator=data_collator,
    compute_metrics=compute_metrics,
)

trainer.train()

Every thing seems to run fine, but the output I get when the model is training is as follows :
image
The precision / recall / accuracy have the same value . I have been through the help here, and Internet / HuggingChat without any working solution.
Any idea ?

1 Like

The average="micro" seems to be a no go (which makes sense, which of the N classes would you display). Discovered also evaluate.combine in the Evalute library docs
The implementea

metric = evaluate.combine(["precision", "recall"])
def compute_metrics(eval_pred):
    predictions, labels = eval_pred
    predictions = np.argmax(predictions, axis=1)
    #
    return metric.compute(predictions=predictions, references=labels, average="macro")

This implementation will not allow to combine accuracy with precision/recall in the case of a multi-class use case. Indeed, if you combine accuracy with precision/recall, you cannot use average as an argument in metric.compute (accuracy does not expect it). And if you leave it as default, precision/recall is set to binary as default.

1 Like