Input size of CLIPVisionModel reverts back to default while using pretrained weights

andygibson · February 23, 2023, 2:54am

I am training the CLIPVisionModel using images of size 704x704. So, I changed the configuration of the CLIPVisionModel to accept the input size of 704x704. The model configuration changes as expected.

configuration = CLIPVisionConfig()
configuration.image_size = 704
vision_model = CLIPVisionModelWithProjection(configuration)
print(vision_model.config.image_size)

Output:
704

However, now when I call from_pretrained on this model the configuration again reverts back to the default input shape of 224.

configuration = CLIPVisionConfig()
configuration.image_size = 704
vision_model = CLIPVisionModelWithProjection(configuration).from_pretrained("openai/clip-vit-base-patch32")
print(vision_model.config.image_size)

Output:
224

I am guessing when I am calling from_pretrained it is also taking the config file from the pretrained model. However, I only to change the input shape while still initializing from the pretrained weights. Given the structure of a transformer, I believe it should be possible to use the same model with a different input shape. Can someone help me figure this out?

joaogante · February 28, 2023, 5:22pm

cc @amyeroberts

amyeroberts · March 1, 2023, 7:28pm

Hi @andygibson, thanks for your question!

To override settings in the configuration when loading from a checkpoint, the kwargs should be passed in the from_pretrained call. The following code should work:

vision_model = CLIPVisionModelWithProjection.from_pretrained(
    "openai/clip-vit-base-patch32", 
    image_size=704, 
    ignore_mismatched_sizes=True
)
print(vision_model.config.image_size)

Note: in order to create this model, you’ll also need to set ignore_mismatched_sizes when changing image_size

andygibson · March 1, 2023, 7:57pm

Thanks a lot! That works.

Ofir-S · July 24, 2023, 2:26pm

Hi! I’m trying something similar with laion-clip-2b, and infering image of size 140X140 instead of 224X224 and having hard time with that:

this:

        from transformers import CLIPModel
        self.model_name = "laion/CLIP-ViT-H-14-laion2B-s32B-b79K" 
        model = CLIPModel.from_pretrained(self.model_name,
                                          image_size=140,
                                          ignore_mismatched_sizes=True)

raises:
model = cls(config, *model_args, **model_kwargs)
TypeError: init() got an unexpected keyword argument ‘image_size’

and I tried also initializing the processor with diferent configs:

     args = {"image_size" : 140,
                # "crop_size": {"height": 140,"width": 140},
            # "do_center_crop": False,
            # "do_rescale": False,
            "do_resize": False}
            # "size": {"shortest_edge": 140}}
            self.processor = CLIPProcessor.from_pretrained(self.model_name,**args)

inputs = self.processor(text=self.text_labels, images=list(images), return_tensors="pt")

but the only way it compiles is with padding=True, and than I get s black image instead of the original one.

any idea how should I do it? thanks

nielsr · July 24, 2023, 2:36pm

@amyeroberts note that this will generate the following warning:

Some weights of CLIPVisionModelWithProjection were not initialized from the model checkpoint at openai/clip-vit-base-patch32 and are newly initialized because the shapes did not match:
- vision_model.embeddings.position_embedding.weight: found shape torch.Size([50, 768]) in the checkpoint and torch.Size([485, 768]) in the model instantiated
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.

This means that all position embeddings of the model will be randomly initialized, making the model not longer useful for predictions as outputs will be distorted.

It might be more beneficial to either 1) interpolate the pre-trained position embeddings to the new size (as done here) or 2) resize the images to the size that the model expects, and perform a forward pass. Alternatively, you can use models that support various image sizes, like the recent DINOv2 model, which interpolates the pre-trained position embeddings by default:

from transformers import Dinov2Model
import torch

model = Dinov2Model.from_pretrained("facebook/dinov2-base")

pixel_values = torch.randn(1, 3, 140, 140)
outputs = model(pixel_values)

Topic		Replies	Views
CLIPVisionModel ViT g-14 has no config.json 🤗Transformers	1	767	July 25, 2023
Weight duplication/reuse for small model config changes? Beginners	0	289	May 10, 2022
Changing Hidden size in Clip Text encoder 🤗Transformers	0	259	February 22, 2024
Fine tuning SAM with input images 256x256 Models	5	2369	May 21, 2024
Image size understanding in DinoV2 🤗Transformers	2	3809	December 21, 2023

Input size of CLIPVisionModel reverts back to default while using pretrained weights

Related topics