Metric.compute() tediously slow when applied to semantic segmentation

Dear all,

I am trying to build a fine-tuning script for a semantic segmentation problem with the SegFormer model. To do this I have followed the official tutorial Semantic segmentation, and the example scripts in this repo The latter are practically identical to the official tutorial. When executing the fine-tuning I have noticed that the evaluation step takes considerably more time than the training step. In particular, the evaluation step becomes extremely slow in the last batch and is basically stuck there for 10 minutes, see below.

As in the tutorial, I am passing the following compute_metrics function to the Trainer.

# Load the mean IoU metric from the datasets package
    metric = evaluate.load("mean_iou")

    # Define our compute_metrics function. It takes an `EvalPrediction` object (a namedtuple with a
    # predictions and label_ids field) and has to return a dictionary string to float.
    def compute_metrics(eval_pred):
        logits, labels = eval_pred
        logits_tensor = torch.from_numpy(logits)
        # scale the logits to the size of the label
        logits_tensor = nn.functional.interpolate(

        pred_labels = logits_tensor.detach().cpu().numpy()
        metrics = metric.compute(
        # add per category metrics as individual key-value pairs
        per_category_accuracy = metrics.pop("per_category_accuracy").tolist()
        per_category_iou = metrics.pop("per_category_iou").tolist()

        metrics.update({f"accuracy_{id2label[i]}": v for i, v in enumerate(per_category_accuracy)})
        metrics.update({f"iou_{id2label[i]}": v for i, v in enumerate(per_category_iou)})

        return metrics

The metric.compute() is the one that takes a long time to finish. Furthermore, I have noticed that encode_nested_example is continuously called and it seems to contribute the most to the computing time. Being new to hugging face Iā€™m frankly a bit stuck here and any help to accelerate the evaluation in this case would be greatly appreciated. I am probably doing something wrong, it seems odd to me that the evaluation takes ca. 10x longer than the training stage.

Best regards,



I face similar problem.
I followed using my own data train on my machine.

metric.add_batch takes about a minute per batch.
Batch size is 40. and image size is 512X512.
I stepped into add_batch.
It calls (defined in then it calls encode_nested_example
which is called recursively few times.
From my 40X512X512 input. encode_nested_example is called again per every 512X512, then per every row and then per every column to cast each pixel value to int64.

It takes long time!