ViTImageProcessor output visualization

ViT takes in an input of resolution 224x224. The ViTImageProcessor just handles the resizing and normalisation of the image to that of the correct resolution.

The patches of 16x16 that you mentioned are taken over this processed image which the Vit model then consumes.

And additionally, if a an RGB image is what you gave to the ViTImageProcessor, an RGB image is what you must get with just sizes of 224x224 and 3 channels. Don’t know how you got 6 channel image.