Proper way to handle non-square images with CLIP?

Hi everyone. I’m wondering about the proper way to handle non-square images with CLIP.

Here are the ways I can think of:

  1. resize the image so that the shortest edge (height or width) is 224 pixels long (or 336 for openai/clip-vit-large-patch14-336) and provide this non-square image to CLIP’s vision encoder ⟵ This assumes that CLIP’s vision model can interpolate the pre-trained position encodings.

  2. add black margins to the image so that it becomes a square, resize it to 224x224 and provide this square image to CLIP’s vision encoder ⟵ this leads to a lower resolution than 1°) :frowning:

  3. directly resize the image to 224x224 and provide this square image to CLIP’s vision encoder ⟵ this changes the aspect ratio of the image :frowning:

  4. resize the image so that the shortest edge (height or width) is 224 pixels long, center-crop the image and provide this square image to CLIP’s vision encoder ⟵ this ignores some parts of the image :frowning:

I understand that 4°) is what happens by default (the CLIP processor has True as the default value for do_resize and do_center_crop).

I tried 1°) but CLIPModel doesn’t have an interpolate_pos_encoding flag like there is one for ViTModel, although CLIP’s vision encoder is a ViT. More precisely, if I do this:

from transformers import CLIPProcessor, CLIPModel
model = CLIPModel.from_pretrained("openai/clip-vit-base-patch16")
processor = CLIPProcessor.from_pretrained("openai/clip-vit-base-patch16")

inputs = processor(
    images=image,
    padding=True,
    do_center_crop=False,
    do_resize=True,
    return_tensors="pt"
)

print(inputs["pixel_values"].shape)
# torch.Size([1, 3, 224, 336])

output = model(**inputs)

… I get the following error message, which suggests that CLIPModel only accepts square images:

RuntimeError: The size of tensor a (295) must match the size of tensor b (197) at non-singleton dimension 1

(this is consistent with CLIPModel not being able to interpolate positional encodings becase 295 = (336/16)^2+1 and 197 = (224/16)^2+1)

Many thanks in advances for your hints and recommendations!

2 Likes

Hello,

I stumbled upon a similar problem and I really cannot use 4° because of the ignored parts of the image. I am using 3°, but I get the feeling that this is incorrect because the models are trained on not stretched images.

Have you found a workaround? Maybe a way to use 1° or a model that is trained on stretched images?

Hi JustasT. I don’t have a solution for the moment unfortunately. I’ve been doing 4) but it isn’t satisfying…

vivien, thank you for the reply.

I have tested multiple ways of embedding the images and then creating a classifier. Stretching the image worked best for me, but I still decided to use padding. Instead of padding black margins, I first preprocess the image with the model processor and then add the zero margins. I found out that it worked better than padding black color.

1 Like

Stretching the image worked best for me

How did you measure?

I first preprocess the image with the model processor and then add the zero margins

What do you mean by add zero margins?