Input size of CLIPVisionModel reverts back to default while using pretrained weights

I am training the CLIPVisionModel using images of size 704x704. So, I changed the configuration of the CLIPVisionModel to accept the input size of 704x704. The model configuration changes as expected.

configuration = CLIPVisionConfig()
configuration.image_size = 704
vision_model = CLIPVisionModelWithProjection(configuration)
print(vision_model.config.image_size)

Output:
704

However, now when I call from_pretrained on this model the configuration again reverts back to the default input shape of 224.

configuration = CLIPVisionConfig()
configuration.image_size = 704
vision_model = CLIPVisionModelWithProjection(configuration).from_pretrained("openai/clip-vit-base-patch32")
print(vision_model.config.image_size)

Output:
224

I am guessing when I am calling from_pretrained it is also taking the config file from the pretrained model. However, I only to change the input shape while still initializing from the pretrained weights. Given the structure of a transformer, I believe it should be possible to use the same model with a different input shape. Can someone help me figure this out?

cc @amyeroberts

Hi @andygibson, thanks for your question!

To override settings in the configuration when loading from a checkpoint, the kwargs should be passed in the from_pretrained call. The following code should work:

vision_model = CLIPVisionModelWithProjection.from_pretrained(
    "openai/clip-vit-base-patch32", 
    image_size=704, 
    ignore_mismatched_sizes=True
)
print(vision_model.config.image_size)

Note: in order to create this model, you’ll also need to set ignore_mismatched_sizes when changing image_size

1 Like

Thanks a lot! That works.

1 Like

Hi! I’m trying something similar with laion-clip-2b, and infering image of size 140X140 instead of 224X224 and having hard time with that:

this:

        from transformers import CLIPModel
        self.model_name = "laion/CLIP-ViT-H-14-laion2B-s32B-b79K" 
        model = CLIPModel.from_pretrained(self.model_name,
                                          image_size=140,
                                          ignore_mismatched_sizes=True)

raises:
model = cls(config, *model_args, **model_kwargs)
TypeError: init() got an unexpected keyword argument ‘image_size’

and I tried also initializing the processor with diferent configs:

     args = {"image_size" : 140,
                # "crop_size": {"height": 140,"width": 140},
            # "do_center_crop": False,
            # "do_rescale": False,
            "do_resize": False}
            # "size": {"shortest_edge": 140}}
            self.processor = CLIPProcessor.from_pretrained(self.model_name,**args)

inputs = self.processor(text=self.text_labels, images=list(images), return_tensors="pt")

but the only way it compiles is with padding=True, and than I get s black image instead of the original one.

any idea how should I do it? thanks

@amyeroberts note that this will generate the following warning:

Some weights of CLIPVisionModelWithProjection were not initialized from the model checkpoint at openai/clip-vit-base-patch32 and are newly initialized because the shapes did not match:
- vision_model.embeddings.position_embedding.weight: found shape torch.Size([50, 768]) in the checkpoint and torch.Size([485, 768]) in the model instantiated
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.

This means that all position embeddings of the model will be randomly initialized, making the model not longer useful for predictions as outputs will be distorted.

It might be more beneficial to either 1) interpolate the pre-trained position embeddings to the new size (as done here) or 2) resize the images to the size that the model expects, and perform a forward pass. Alternatively, you can use models that support various image sizes, like the recent DINOv2 model, which interpolates the pre-trained position embeddings by default:

from transformers import Dinov2Model
import torch

model = Dinov2Model.from_pretrained("facebook/dinov2-base")

pixel_values = torch.randn(1, 3, 140, 140)
outputs = model(pixel_values)