Hi,
Yes definitely, let me open a PR today to add this!
And yes ViT is very much like BERT it outputs a vector of size 768 for each “patch” (which can be seen as each “word”), whereas a model like ResNet outputs a “feature map” of shape (batch_size, num_channels, height, width).