I have a set of 1024x1024 images and I am trying to fine-tune the segformer-b4-finetuned-cityscapes-1024-1024 pre-trained model for semantic segmentation.
The config for this model (linked above) says that the image size should be 224. If I set
ignore_mismatched_sizes=True, I can pass it 1024x1024 images without a problem with seemingly pretty strong results.
I am wondering, though, what is happening behind the scenes? If the model expects a 224 image but receives a 1024 image, how is that handled? Is the image downsampled before being fed to the model? Is it chunked into 224 pixel inputs?
Additionally, the feature extractor for this model is seemingly set to resize to 512.
>>> feature_extractor = SegformerFeatureExtractor.from_pretrained("nvidia/segformer-b4-finetuned-cityscapes-1024-1024")`
I should actually remove the
downsampling_rates attributes of
SegformerConfig, cause both aren’t used by the model (these are leftovers from
ViTConfig). The only things that matter are
config.patch_sizes (which is set to 7, 3, 3, 3) for all SegFormer variants),
config.hidden_sizes to perform the Overlapped Patch Merging process as explained in the paper (section 3.1).
This also explains why it works on different image sizes (like 512 or 1024), cause the patch embedding layers will pad the input if the input size (like 1024) isn’t divisible by the patch size (like 7). The authors use a padding value of 3 in case the patch size is 7, and a value of 1 in case the patch size is 3. This makes the model work on any input image.
This also makes the model very flexible: it works on any input size you can for instance forward an image of size (189, 203) through the model and it will also work. Also from the paper:
As a result, our encoder can easily adapt to arbitrary test resolutions without impacting the performance.
Hope that clarifies your question!
Thanks, @nielsr! That is very clarifying. One follow-up question:
When I pass in my 1024x1024 images following the approach in the fine-tuning example notebook, the shape of my
torch.Size([2, 3, 1024, 1024]) (batch size 2), but the shape of
torch.Size([2, 4, 256, 256]). Will the model always return logits in 256x256 shape or is that customizable?
Edit: I see now in Section 3.2 of the original paper that the segmentation mask is
H/4 x W/4 x Ncls, which I suppose suggests this is hard-wired into the model architecture.
Yes, SegFormer outputs logits of shape (batch_size, num_labels, height / 4, width / 4). So you need to upsample them to the original size of the image using
Have you seen our new blog post btw? Fine-Tune a Semantic Segmentation Model with a Custom Dataset
It includes an up-to-date version of my notebook, and some nice utilities such as the mIoU metric, which is now available in the Datasets library.
Thanks again, @nielsr. That new blog post looks excellent!