Image encoder inference with variable size imags

AmitMY · August 15, 2025, 3:28pm

I am training/predicting using a model that sees variable size images within the same batch.
One batch, can include two images, one is (64, 64) and another is (64, 128)

After running through the processor, I pad all images to be (64, 128) with 0 values.

Now, if I encode the first image unpadded, vs padded, I get different embeddings for even the first patch of the image, this is due to self-attention across the patches.

I have assembled a bool_masked_pos according to documentation, which tells which patches should be masked (based on model.config.patch_size)

I am now running a microsoft/swinv2-tiny-patch4-window16-256 model, and while it does accept arbitrary sized images, it does not accept bool_masked_pos

Is there a generic way to train/predict different size images correctly, in the same batch?

**I do understand that for the Swin transformer I am using for example, there are variable patch sizes. Maybe every patch that is fully 0 should be masked.

John6666 · August 16, 2025, 1:24am

Is there a generic way to train/predict different size images correctly, in the same batch?

For example, I think the pixel mask in DETR corresponds to this, but I don’t think it can be used with swinv2.

DETR
NaViT

AmitMY · August 16, 2025, 5:46am

Thanks! So the behavior of DETR and NaVIT is different, DINOv3 is also different, and CLIP / SwinV2 does not seem to support it in any way?

It seems odd that there is a clear way to do this for text sequences (attention_mask) but sometimes not at all for image models.

John6666 · August 16, 2025, 6:30am

So the behavior of DETR and NaVIT is different, DINOv3 is also different,

Yeah.

and CLIP / SwinV2 does not seem to support it in any way?

I think so, to a certain extent. It seems possible to achieve this by customizing the model class code or by preprocessing the input in advance.
However, if you are training for multiple resolutions, I think it would be easier to use a model that supports this from the start…

It seems odd that there is a clear way to do this for text sequences (attention_mask) but sometimes not at all for image models.

I wonder if this is due to that ViT does not support this by default.

Angelina067 · August 16, 2025, 6:51am

Actually when we pad images with zeros, the model “sees” those zero areas and this changes the results even for the real parts of the image most likely having empty seats in a meeting that still affect the discussion.

You need to tell the model to ignore the padded areas. Here’s a simple approach:

Simple solution - just add this wrapper around your existing Swin model

import torch
import torch.nn as nn
from transformers import Swinv2Model

class FixedSwinModel(nn.Module):
def init(self):
super().init()

Your existing model

self.swin = Swinv2Model.from_pretrained(“microsoft/swinv2-tiny-patch4-window16-256”)
self.patch_size = 4 # Swin patch size

def forward(self, images, original_sizes):
    # Step 1: Run the model normally
    outputs = self.swin(images)
    
    # Step 2: Find and remove padded regions
    embeddings = outputs.last_hidden_state
    masked_embeddings = self.remove_padding_effect(embeddings, images, original_sizes)
    
    # Step 3: Return clean results
    outputs.last_hidden_state = masked_embeddings
    return outputs

def remove_padding_effect(self, embeddings, images, original_sizes):
    """Remove the effect of zero-padded regions"""
    batch_size = images.shape[0]
    
    for i in range(batch_size):
        orig_h, orig_w = original_sizes[i]
        padded_h, padded_w = images.shape[2], images.shape[3]
        
        # Calculate how many patches are real vs padded
        real_patches_h = (orig_h + self.patch_size - 1) // self.patch_size
        real_patches_w = (orig_w + self.patch_size - 1) // self.patch_size
        total_patches_h = padded_h // self.patch_size
        total_patches_w = padded_w // self.patch_size
        
        # Create a mask for real patches
        valid_patches = real_patches_h * real_patches_w
        total_patches = total_patches_h * total_patches_w
        
        # Zero out embeddings from padded patches
        if embeddings.dim() == 3:  # [batch, patches, features]
            embeddings[i, valid_patches:] = 0
        
    return embeddings

How to use it (replace your current model):

model = FixedSwinModel()

Your existing batch processing with one small change:

def process_batch(image_list):

Store original sizes (add this line)

original_sizes = [(img.shape[1], img.shape[2]) for img in image_list]

# Your existing padding code
max_h = max(img.shape[1] for img in image_list)
max_w = max(img.shape[2] for img in image_list)

padded_images = []
for img in image_list:
    padded = torch.nn.functional.pad(img, (0, max_w - img.shape[2], 0, max_h - img.shape[1]))
    padded_images.append(padded)

batch = torch.stack(padded_images)

# Updated model call (pass original sizes)
outputs = model(batch, original_sizes)
return outputs

Don’t worry, the fix is actually quite straightforward.

You keep using the same Swin model and padding approach, just add a wrapper that removes the influence of padded areas.

Only need to track original image sizes and pass them to the model and ensures consistent embeddings whether images are padded or not.

Good luck!

AmitMY · August 16, 2025, 8:35am

Thanks @Angelina067 that seems reasonable.

I tested it, by running two images padded separately through the model, and looking at the embedding of the first patch output.hidden_states[-1][0, 0] (first 0 is batch, second 0 is patch) -

and they are different, despite both images being right-padded, so this patch should be identical.

I believe this is due to the self-attention mechanism in Swin, that the first patch attends to the padding patches as well.

Topic		Replies	Views
Basic questions about padding in the Original ViT Models	0	300	April 25, 2024
Padding strategy for classification Beginners	3	2538	July 20, 2020
Is it possible to train ViT with different number of patches in every batch? (Non-square images dataset) Models	3	3246	May 1, 2024
Swin Transformer for segmentation Beginners	1	2220	November 3, 2022
Sequences shorter than model's input window size 🤗Transformers	2	1190	January 4, 2022

Image encoder inference with variable size imags

Your existing model

How to use it (replace your current model):

Your existing batch processing with one small change:

Store original sizes (add this line)

Related topics