I am training the CLIPVisionModel using images of size 704x704. So, I changed the configuration of the CLIPVisionModel to accept the input size of 704x704. The model configuration changes as expected.
I am guessing when I am calling from_pretrained it is also taking the config file from the pretrained model. However, I only to change the input shape while still initializing from the pretrained weights. Given the structure of a transformer, I believe it should be possible to use the same model with a different input shape. Can someone help me figure this out?
To override settings in the configuration when loading from a checkpoint, the kwargs should be passed in the from_pretrained call. The following code should work:
@amyeroberts note that this will generate the following warning:
Some weights of CLIPVisionModelWithProjection were not initialized from the model checkpoint at openai/clip-vit-base-patch32 and are newly initialized because the shapes did not match:
- vision_model.embeddings.position_embedding.weight: found shape torch.Size([50, 768]) in the checkpoint and torch.Size([485, 768]) in the model instantiated
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
This means that all position embeddings of the model will be randomly initialized, making the model not longer useful for predictions as outputs will be distorted.
It might be more beneficial to either 1) interpolate the pre-trained position embeddings to the new size (as done here) or 2) resize the images to the size that the model expects, and perform a forward pass. Alternatively, you can use models that support various image sizes, like the recent DINOv2 model, which interpolates the pre-trained position embeddings by default:
from transformers import Dinov2Model
import torch
model = Dinov2Model.from_pretrained("facebook/dinov2-base")
pixel_values = torch.randn(1, 3, 140, 140)
outputs = model(pixel_values)