Input size of CLIPVisionModel reverts back to default while using pretrained weights

I am training the CLIPVisionModel using images of size 704x704. So, I changed the configuration of the CLIPVisionModel to accept the input size of 704x704. The model configuration changes as expected.

configuration = CLIPVisionConfig()
configuration.image_size = 704
vision_model = CLIPVisionModelWithProjection(configuration)
print(vision_model.config.image_size)

Output:
704

However, now when I call from_pretrained on this model the configuration again reverts back to the default input shape of 224.

configuration = CLIPVisionConfig()
configuration.image_size = 704
vision_model = CLIPVisionModelWithProjection(configuration).from_pretrained("openai/clip-vit-base-patch32")
print(vision_model.config.image_size)

Output:
224

I am guessing when I am calling from_pretrained it is also taking the config file from the pretrained model. However, I only to change the input shape while still initializing from the pretrained weights. Given the structure of a transformer, I believe it should be possible to use the same model with a different input shape. Can someone help me figure this out?

cc @amyeroberts

Hi @andygibson, thanks for your question!

To override settings in the configuration when loading from a checkpoint, the kwargs should be passed in the from_pretrained call. The following code should work:

vision_model = CLIPVisionModelWithProjection.from_pretrained(
    "openai/clip-vit-base-patch32", 
    image_size=704, 
    ignore_mismatched_sizes=True
)
print(vision_model.config.image_size)

Note: in order to create this model, you’ll also need to set ignore_mismatched_sizes when changing image_size

1 Like

Thanks a lot! That works.

1 Like