Image encoder inference with variable size imags

I am training/predicting using a model that sees variable size images within the same batch.
One batch, can include two images, one is (64, 64) and another is (64, 128)

After running through the processor, I pad all images to be (64, 128) with 0 values.

Now, if I encode the first image unpadded, vs padded, I get different embeddings for even the first patch of the image, this is due to self-attention across the patches.

I have assembled a bool_masked_pos according to documentation, which tells which patches should be masked (based on model.config.patch_size)

I am now running a microsoft/swinv2-tiny-patch4-window16-256 model, and while it does accept arbitrary sized images, it does not accept bool_masked_pos

Is there a generic way to train/predict different size images correctly, in the same batch?

**I do understand that for the Swin transformer I am using for example, there are variable patch sizes. Maybe every patch that is fully 0 should be masked.

1 Like

Is there a generic way to train/predict different size images correctly, in the same batch?

For example, I think the pixel mask in DETR corresponds to this, but I don’t think it can be used with swinv2.

Thanks! So the behavior of DETR and NaVIT is different, DINOv3 is also different, and CLIP / SwinV2 does not seem to support it in any way?

It seems odd that there is a clear way to do this for text sequences (attention_mask) but sometimes not at all for image models.

1 Like

So the behavior of DETR and NaVIT is different, DINOv3 is also different,

Yeah.

and CLIP / SwinV2 does not seem to support it in any way?

I think so, to a certain extent. It seems possible to achieve this by customizing the model class code or by preprocessing the input in advance.
However, if you are training for multiple resolutions, I think it would be easier to use a model that supports this from the start…:sweat_smile:

It seems odd that there is a clear way to do this for text sequences (attention_mask) but sometimes not at all for image models.

I wonder if this is due to that ViT does not support this by default.

1 Like

Actually when we pad images with zeros, the model “sees” those zero areas and this changes the results even for the real parts of the image most likely having empty seats in a meeting that still affect the discussion.

You need to tell the model to ignore the padded areas. Here’s a simple approach:

Simple solution - just add this wrapper around your existing Swin model

import torch
import torch.nn as nn
from transformers import Swinv2Model

class FixedSwinModel(nn.Module):
def init(self):
super().init()

Your existing model

self.swin = Swinv2Model.from_pretrained(“microsoft/swinv2-tiny-patch4-window16-256”)
self.patch_size = 4 # Swin patch size

def forward(self, images, original_sizes):
    # Step 1: Run the model normally
    outputs = self.swin(images)
    
    # Step 2: Find and remove padded regions
    embeddings = outputs.last_hidden_state
    masked_embeddings = self.remove_padding_effect(embeddings, images, original_sizes)
    
    # Step 3: Return clean results
    outputs.last_hidden_state = masked_embeddings
    return outputs

def remove_padding_effect(self, embeddings, images, original_sizes):
    """Remove the effect of zero-padded regions"""
    batch_size = images.shape[0]
    
    for i in range(batch_size):
        orig_h, orig_w = original_sizes[i]
        padded_h, padded_w = images.shape[2], images.shape[3]
        
        # Calculate how many patches are real vs padded
        real_patches_h = (orig_h + self.patch_size - 1) // self.patch_size
        real_patches_w = (orig_w + self.patch_size - 1) // self.patch_size
        total_patches_h = padded_h // self.patch_size
        total_patches_w = padded_w // self.patch_size
        
        # Create a mask for real patches
        valid_patches = real_patches_h * real_patches_w
        total_patches = total_patches_h * total_patches_w
        
        # Zero out embeddings from padded patches
        if embeddings.dim() == 3:  # [batch, patches, features]
            embeddings[i, valid_patches:] = 0
        
    return embeddings

How to use it (replace your current model):

model = FixedSwinModel()

Your existing batch processing with one small change:

def process_batch(image_list):

Store original sizes (add this line)

original_sizes = [(img.shape[1], img.shape[2]) for img in image_list]

# Your existing padding code
max_h = max(img.shape[1] for img in image_list)
max_w = max(img.shape[2] for img in image_list)

padded_images = []
for img in image_list:
    padded = torch.nn.functional.pad(img, (0, max_w - img.shape[2], 0, max_h - img.shape[1]))
    padded_images.append(padded)

batch = torch.stack(padded_images)

# Updated model call (pass original sizes)
outputs = model(batch, original_sizes)
return outputs

Don’t worry, the fix is actually quite straightforward.

You keep using the same Swin model and padding approach, just add a wrapper that removes the influence of padded areas.

Only need to track original image sizes and pass them to the model and ensures consistent embeddings whether images are padded or not.

Good luck!

1 Like

Thanks @Angelina067 that seems reasonable.

I tested it, by running two images padded separately through the model, and looking at the embedding of the first patch output.hidden_states[-1][0, 0] (first 0 is batch, second 0 is patch) -

and they are different, despite both images being right-padded, so this patch should be identical.

I believe this is due to the self-attention mechanism in Swin, that the first patch attends to the padding patches as well.

1 Like