I have a set of 1024x1024 images and I am trying to fine-tune the segformer-b4-finetuned-cityscapes-1024-1024 pre-trained model for semantic segmentation.
The config for this model (linked above) says that the image size should be 224. If I set ignore_mismatched_sizes=True
, I can pass it 1024x1024 images without a problem with seemingly pretty strong results.
I am wondering, though, what is happening behind the scenes? If the model expects a 224 image but receives a 1024 image, how is that handled? Is the image downsampled before being fed to the model? Is it chunked into 224 pixel inputs?
Additionally, the feature extractor for this model is seemingly set to resize to 512.
>>> feature_extractor = SegformerFeatureExtractor.from_pretrained("nvidia/segformer-b4-finetuned-cityscapes-1024-1024")`
>>> feature_extractor
SegformerFeatureExtractor {
"do_normalize": true,
"do_resize": true,
"feature_extractor_type": "SegformerFeatureExtractor",
"image_mean": [
0.485,
0.456,
0.406
],
"image_std": [
0.229,
0.224,
0.225
],
"reduce_labels": false,
"resample": 2,
"size": 512
}