Hi everyone.
I have an image classification dataset consisting of non-square images with different sizes each of them.
Training CNN, I used to rescale them to have 224 longer side and pad with zeros other side to make them square.
Then I decided to use ViT and figured out zero padding drastically affect classification performance since lot of patches have only zeros.
Random cropping and force rescaling to be square does not work because it is important to include all of the object in image and preserve w/h ratios.
What I want to do that is feeding input as varying sizes by rescaling to have 224 on longer side and X on shorter. I know that tensors in the same batch must have same shape so assume it is done when collate. ( size: (BatchSize, 3, 224, 160) for example).
I have loaded a ViTModel and tried giving different different size of inputs to see how outputs are. There are 49 patches + 1 cls = 50 patches with 112,112 input. But when I make one dim to 110, I lost 7 patches.
I have no idea how pos encoding interpolation done and if it is right to use interpolate_pos_encoding=True parameter like this.
My question is does it make sense to train it with different shape non-square batches? What do you suggest?
Thanks.