Dear all,

I am trying to build a fine-tuning script for a semantic segmentation problem with the SegFormer model. To do this I have followed the official tutorial Semantic segmentation, and the example scripts in this repo https://github.com/huggingface/transformers/tree/main/examples/pytorch/semantic-segmentation. The latter are practically identical to the official tutorial. When executing the fine-tuning I have noticed that the evaluation step takes considerably more time than the training step. In particular, the evaluation step becomes extremely slow in the last batch and is basically stuck there for 10 minutes, see below.

As in the tutorial, I am passing the following `compute_metrics`

function to the `Trainer`

.

```
# Load the mean IoU metric from the datasets package
metric = evaluate.load("mean_iou")
# Define our compute_metrics function. It takes an `EvalPrediction` object (a namedtuple with a
# predictions and label_ids field) and has to return a dictionary string to float.
@torch.no_grad()
def compute_metrics(eval_pred):
logits, labels = eval_pred
logits_tensor = torch.from_numpy(logits)
# scale the logits to the size of the label
logits_tensor = nn.functional.interpolate(
logits_tensor,
size=labels.shape[-2:],
mode="bilinear",
align_corners=False,
).argmax(dim=1)
pred_labels = logits_tensor.detach().cpu().numpy()
metrics = metric.compute(
predictions=pred_labels,
references=labels,
num_labels=len(id2label),
ignore_index=255,
reduce_labels=image_processor.do_reduce_labels,
)
# add per category metrics as individual key-value pairs
per_category_accuracy = metrics.pop("per_category_accuracy").tolist()
per_category_iou = metrics.pop("per_category_iou").tolist()
metrics.update({f"accuracy_{id2label[i]}": v for i, v in enumerate(per_category_accuracy)})
metrics.update({f"iou_{id2label[i]}": v for i, v in enumerate(per_category_iou)})
return metrics
```

The `metric.compute()`

is the one that takes a long time to finish. Furthermore, I have noticed that `encode_nested_example`

is continuously called and it seems to contribute the most to the computing time. Being new to hugging face Iām frankly a bit stuck here and any help to accelerate the evaluation in this case would be greatly appreciated. I am probably doing something wrong, it seems odd to me that the evaluation takes ca. 10x longer than the training stage.

Best regards,

Kirill