Fine tuning image transformer on higher resolution

Hi,

Yes definitely, let me open a PR today to add this!

And yes ViT is very much like BERT :wink: it outputs a vector of size 768 for each “patch” (which can be seen as each “word”), whereas a model like ResNet outputs a “feature map” of shape (batch_size, num_channels, height, width).