Proper way to handle non-square images with CLIP?

vivien · March 1, 2023, 2:31pm

Hi everyone. I’m wondering about the proper way to handle non-square images with CLIP.

Here are the ways I can think of:

resize the image so that the shortest edge (height or width) is 224 pixels long (or 336 for openai/clip-vit-large-patch14-336) and provide this non-square image to CLIP’s vision encoder ⟵ This assumes that CLIP’s vision model can interpolate the pre-trained position encodings.
add black margins to the image so that it becomes a square, resize it to 224x224 and provide this square image to CLIP’s vision encoder ⟵ this leads to a lower resolution than 1°)
directly resize the image to 224x224 and provide this square image to CLIP’s vision encoder ⟵ this changes the aspect ratio of the image
resize the image so that the shortest edge (height or width) is 224 pixels long, center-crop the image and provide this square image to CLIP’s vision encoder ⟵ this ignores some parts of the image

I understand that 4°) is what happens by default (the CLIP processor has True as the default value for do_resize and do_center_crop).

I tried 1°) but CLIPModel doesn’t have an interpolate_pos_encoding flag like there is one for ViTModel, although CLIP’s vision encoder is a ViT. More precisely, if I do this:

from transformers import CLIPProcessor, CLIPModel
model = CLIPModel.from_pretrained("openai/clip-vit-base-patch16")
processor = CLIPProcessor.from_pretrained("openai/clip-vit-base-patch16")

inputs = processor(
    images=image,
    padding=True,
    do_center_crop=False,
    do_resize=True,
    return_tensors="pt"
)

print(inputs["pixel_values"].shape)
# torch.Size([1, 3, 224, 336])

output = model(**inputs)

… I get the following error message, which suggests that CLIPModel only accepts square images:

RuntimeError: The size of tensor a (295) must match the size of tensor b (197) at non-singleton dimension 1

(this is consistent with CLIPModel not being able to interpolate positional encodings becase 295 = (336/16)^2+1 and 197 = (224/16)^2+1)

Many thanks in advances for your hints and recommendations!

JustasT · April 28, 2023, 8:50am

Hello,

I stumbled upon a similar problem and I really cannot use 4° because of the ignored parts of the image. I am using 3°, but I get the feeling that this is incorrect because the models are trained on not stretched images.

Have you found a workaround? Maybe a way to use 1° or a model that is trained on stretched images?

vivien · April 29, 2023, 4:32pm

Hi JustasT. I don’t have a solution for the moment unfortunately. I’ve been doing 4) but it isn’t satisfying…

JustasT · May 9, 2023, 11:17am

vivien, thank you for the reply.

I have tested multiple ways of embedding the images and then creating a classifier. Stretching the image worked best for me, but I still decided to use padding. Instead of padding black margins, I first preprocess the image with the model processor and then add the zero margins. I found out that it worked better than padding black color.

endolith · August 13, 2023, 10:06pm

Stretching the image worked best for me

How did you measure?

I first preprocess the image with the model processor and then add the zero margins

What do you mean by add zero margins?

Topic		Replies	Views
How to use Transformers ViTs with different resolutions like in timm? 🤗Transformers	0	69	November 14, 2024
CLIPVisionModel Padding Problem 🤗Transformers	2	154	November 18, 2024
Input size of CLIPVisionModel reverts back to default while using pretrained weights Beginners	5	1523	July 24, 2023
Encoding video frames using CLIP 🤗Transformers	0	1341	June 12, 2022
CLIPModel finetuning Models	9	9218	July 20, 2022

Proper way to handle non-square images with CLIP?

Related topics