Hi everyone. I’m wondering about the proper way to handle non-square images with CLIP.
Here are the ways I can think of:
resize the image so that the shortest edge (height or width) is 224 pixels long (or 336 for
openai/clip-vit-large-patch14-336) and provide this non-square image to CLIP’s vision encoder ⟵ This assumes that CLIP’s vision model can interpolate the pre-trained position encodings.
add black margins to the image so that it becomes a square, resize it to 224x224 and provide this square image to CLIP’s vision encoder ⟵ this leads to a lower resolution than 1°)
directly resize the image to 224x224 and provide this square image to CLIP’s vision encoder ⟵ this changes the aspect ratio of the image
resize the image so that the shortest edge (height or width) is 224 pixels long, center-crop the image and provide this square image to CLIP’s vision encoder ⟵ this ignores some parts of the image
I understand that 4°) is what happens by default (the CLIP processor has
True as the default value for
I tried 1°) but
CLIPModel doesn’t have an
interpolate_pos_encoding flag like there is one for
ViTModel, although CLIP’s vision encoder is a ViT. More precisely, if I do this:
from transformers import CLIPProcessor, CLIPModel model = CLIPModel.from_pretrained("openai/clip-vit-base-patch16") processor = CLIPProcessor.from_pretrained("openai/clip-vit-base-patch16") inputs = processor( images=image, padding=True, do_center_crop=False, do_resize=True, return_tensors="pt" ) print(inputs["pixel_values"].shape) # torch.Size([1, 3, 224, 336]) output = model(**inputs)
… I get the following error message, which suggests that
CLIPModel only accepts square images:
RuntimeError: The size of tensor a (295) must match the size of tensor b (197) at non-singleton dimension 1
(this is consistent with
CLIPModel not being able to interpolate positional encodings becase 295 = (336/16)^2+1 and 197 = (224/16)^2+1)
Many thanks in advances for your hints and recommendations!