Hi everyone. I’m wondering about the proper way to handle non-square images with CLIP.
Here are the ways I can think of:
-
resize the image so that the shortest edge (height or width) is 224 pixels long (or 336 for
openai/clip-vit-large-patch14-336
) and provide this non-square image to CLIP’s vision encoder ⟵ This assumes that CLIP’s vision model can interpolate the pre-trained position encodings. -
add black margins to the image so that it becomes a square, resize it to 224x224 and provide this square image to CLIP’s vision encoder ⟵ this leads to a lower resolution than 1°)
-
directly resize the image to 224x224 and provide this square image to CLIP’s vision encoder ⟵ this changes the aspect ratio of the image
-
resize the image so that the shortest edge (height or width) is 224 pixels long, center-crop the image and provide this square image to CLIP’s vision encoder ⟵ this ignores some parts of the image
I understand that 4°) is what happens by default (the CLIP processor has True
as the default value for do_resize
and do_center_crop
).
I tried 1°) but CLIPModel
doesn’t have an interpolate_pos_encoding
flag like there is one for ViTModel
, although CLIP’s vision encoder is a ViT. More precisely, if I do this:
from transformers import CLIPProcessor, CLIPModel
model = CLIPModel.from_pretrained("openai/clip-vit-base-patch16")
processor = CLIPProcessor.from_pretrained("openai/clip-vit-base-patch16")
inputs = processor(
images=image,
padding=True,
do_center_crop=False,
do_resize=True,
return_tensors="pt"
)
print(inputs["pixel_values"].shape)
# torch.Size([1, 3, 224, 336])
output = model(**inputs)
… I get the following error message, which suggests that CLIPModel
only accepts square images:
RuntimeError: The size of tensor a (295) must match the size of tensor b (197) at non-singleton dimension 1
(this is consistent with CLIPModel
not being able to interpolate positional encodings becase 295 = (336/16)^2+1 and 197 = (224/16)^2+1)
Many thanks in advances for your hints and recommendations!