How does Segformer handle image size differences?

I have a set of 1024x1024 images and I am trying to fine-tune the segformer-b4-finetuned-cityscapes-1024-1024 pre-trained model for semantic segmentation.

The config for this model (linked above) says that the image size should be 224. If I set ignore_mismatched_sizes=True, I can pass it 1024x1024 images without a problem with seemingly pretty strong results.

I am wondering, though, what is happening behind the scenes? If the model expects a 224 image but receives a 1024 image, how is that handled? Is the image downsampled before being fed to the model? Is it chunked into 224 pixel inputs?

Additionally, the feature extractor for this model is seemingly set to resize to 512.

>>> feature_extractor = SegformerFeatureExtractor.from_pretrained("nvidia/segformer-b4-finetuned-cityscapes-1024-1024")`
>>> feature_extractor
SegformerFeatureExtractor {
  "do_normalize": true,
  "do_resize": true,
  "feature_extractor_type": "SegformerFeatureExtractor",
  "image_mean": [
  "image_std": [
  "reduce_labels": false,
  "resample": 2,
  "size": 512


I should actually remove the image_size and downsampling_rates attributes of SegformerConfig, cause both aren’t used by the model (these are leftovers from ViTConfig). The only things that matter are config.patch_sizes (which is set to 7, 3, 3, 3) for all SegFormer variants), config.strides and config.hidden_sizes to perform the Overlapped Patch Merging process as explained in the paper (section 3.1).

This also explains why it works on different image sizes (like 512 or 1024), cause the patch embedding layers will pad the input if the input size (like 1024) isn’t divisible by the patch size (like 7). The authors use a padding value of 3 in case the patch size is 7, and a value of 1 in case the patch size is 3. This makes the model work on any input image.

This also makes the model very flexible: it works on any input size :slight_smile: you can for instance forward an image of size (189, 203) through the model and it will also work. Also from the paper:

As a result, our encoder can easily adapt to arbitrary test resolutions without impacting the performance.

Hope that clarifies your question!

1 Like

Thanks, @nielsr! That is very clarifying. One follow-up question:

When I pass in my 1024x1024 images following the approach in the fine-tuning example notebook, the shape of my pixel_values is torch.Size([2, 3, 1024, 1024]) (batch size 2), but the shape of outputs.logits is torch.Size([2, 4, 256, 256]). Will the model always return logits in 256x256 shape or is that customizable?

Edit: I see now in Section 3.2 of the original paper that the segmentation mask is H/4 x W/4 x Ncls, which I suppose suggests this is hard-wired into the model architecture.

Yes, SegFormer outputs logits of shape (batch_size, num_labels, height / 4, width / 4). So you need to upsample them to the original size of the image using torch.nn.functional.interpolate.

Have you seen our new blog post btw? Fine-Tune a Semantic Segmentation Model with a Custom Dataset

It includes an up-to-date version of my notebook, and some nice utilities such as the mIoU metric, which is now available in the Datasets library.

1 Like

Thanks again, @nielsr. That new blog post looks excellent!