Fine-tuning ViT with more patches/higher resolution

Hi there,
A huge thank you in advance for everyone’s help - really love this forum!!
I would like to fine-tune a ViT at higher resolution, starting with a pretrained model that is trained at 384x384. Is this as simple as creating a new ViTFeatureExtractor and passing interpolate_pos_encoding=True along with pixel_values during training? It seems to me for TRAINING something at higher resolution you would like to be able to train new position encodings instead of interpolating…

Have been googling around a lot for this… wonder if anyone has a good recipe to start with a pre-trained ViT and increase the number of patches during fine tuning… it seems that many of the pre-trained models are actually trained in this way (224x224 then 384x384… always with 16x16 patches, so more patches… longer sequence length). When I try this naively using HuggingFace it just says the image size does not match the model’s image_size : )

Thank you again.

Hi @mohotmoz ,
I think you just need to

  1. set your target size (>224) in your initial data transforms,
  2. turn on interpolate_pos_encoding in your forward pass, both at finetuning and evaluating.
1 Like