How does Segformer handle image size differences?

deppen8 · March 20, 2022, 1:49am

I have a set of 1024x1024 images and I am trying to fine-tune the segformer-b4-finetuned-cityscapes-1024-1024 pre-trained model for semantic segmentation.

The config for this model (linked above) says that the image size should be 224. If I set ignore_mismatched_sizes=True, I can pass it 1024x1024 images without a problem with seemingly pretty strong results.

I am wondering, though, what is happening behind the scenes? If the model expects a 224 image but receives a 1024 image, how is that handled? Is the image downsampled before being fed to the model? Is it chunked into 224 pixel inputs?

Additionally, the feature extractor for this model is seemingly set to resize to 512.

>>> feature_extractor = SegformerFeatureExtractor.from_pretrained("nvidia/segformer-b4-finetuned-cityscapes-1024-1024")`
>>> feature_extractor
SegformerFeatureExtractor {
  "do_normalize": true,
  "do_resize": true,
  "feature_extractor_type": "SegformerFeatureExtractor",
  "image_mean": [
    0.485,
    0.456,
    0.406
  ],
  "image_std": [
    0.229,
    0.224,
    0.225
  ],
  "reduce_labels": false,
  "resample": 2,
  "size": 512
}

nielsr · March 21, 2022, 8:34am

Hi,

I should actually remove the image_size and downsampling_rates attributes of SegformerConfig, cause both aren’t used by the model (these are leftovers from ViTConfig). The only things that matter are config.patch_sizes (which is set to 7, 3, 3, 3) for all SegFormer variants), config.strides and config.hidden_sizes to perform the Overlapped Patch Merging process as explained in the paper (section 3.1).

This also explains why it works on different image sizes (like 512 or 1024), cause the patch embedding layers will pad the input if the input size (like 1024) isn’t divisible by the patch size (like 7). The authors use a padding value of 3 in case the patch size is 7, and a value of 1 in case the patch size is 3. This makes the model work on any input image.

This also makes the model very flexible: it works on any input size you can for instance forward an image of size (189, 203) through the model and it will also work. Also from the paper:

As a result, our encoder can easily adapt to arbitrary test resolutions without impacting the performance.

Hope that clarifies your question!

deppen8 · March 21, 2022, 4:47pm

Thanks, @nielsr! That is very clarifying. One follow-up question:

When I pass in my 1024x1024 images following the approach in the fine-tuning example notebook, the shape of my pixel_values is torch.Size([2, 3, 1024, 1024]) (batch size 2), but the shape of outputs.logits is torch.Size([2, 4, 256, 256]). Will the model always return logits in 256x256 shape or is that customizable?

Edit: I see now in Section 3.2 of the original paper that the segmentation mask is H/4 x W/4 x Ncls, which I suppose suggests this is hard-wired into the model architecture.

nielsr · March 22, 2022, 7:40am

Yes, SegFormer outputs logits of shape (batch_size, num_labels, height / 4, width / 4). So you need to upsample them to the original size of the image using torch.nn.functional.interpolate.

Have you seen our new blog post btw? Fine-Tune a Semantic Segmentation Model with a Custom Dataset

It includes an up-to-date version of my notebook, and some nice utilities such as the mIoU metric, which is now available in the Datasets library.

deppen8 · March 22, 2022, 5:48pm

Thanks again, @nielsr. That new blog post looks excellent!

Starlento · April 20, 2023, 2:59pm

I am still a little bit confused about the model card. Just like @deppen8 mentioned.
I think it should be 1024 for this model card? Then I have a follow up question related to the mean and std in this config.json. Is this the correct one to use?

Topic		Replies	Views
Exploring Segformer but its giving out Value error for input size, and expects to be 128x128 🤗Transformers	3	602	July 19, 2022
Small document issue for segformer? Models	1	407	December 7, 2021
Segformer train doesn't recognize downsized images Beginners	0	210	March 11, 2023
Changing the shap of the output of Segformer Models	1	579	September 6, 2023
Performance Issue Segformer Cityscapes 1024-1024 Models	0	388	July 4, 2023

How does Segformer handle image size differences?

Related topics