What is ViTImageProcessor doing?

As you can see in the image, (on the right) the image is still fairly understandable after applying normalization. But however, ViTImageProcessor does something totally different. So, I am wondering if anyone can help me understand that. I am trying to finetune ViT with augmentation applied to images in Tensorflow. But, the it is taking 2s per batch (regardless of batch size: 32, 16) because how slow ViTImageProcessor is doing its thing.

So another question that I’ll ask in this thread is, will it work if I apply augmentation(random_brightness, random_contrast, gaussian_noise) to the output of ViTImageProcessor. I am doing this research on my own and don’t have exclusive GPU availability. The GPU time on kaggle only allows 2-3 tests each week. And I can’t always be infront of screen to use Google Colab.

P.S. I have tried normalization with mean=0.5 and std=0.5 as used by VitImageProcessor. Now the question is also about image reconstruction with reshape i.e. (224,224,3) to (3,224,224) and vice-versa.

Ok! I seem to understand what is happening.

  1. The resize method from tensorflow and PIL is working bit differently (even if it was Bilinear in both; how do i know this? After is did do_resize=False I got similar result).

  2. I can’t reconstruct the output from ImageProcessor with np.reshape and tf.reshape. The reshaping method used by ImageProcessor is working differently such that reconstruction is not possible. I did tried all of ‘CFA’ order in np.reshape.

So, (1) is all right, but (2) is going to be a problem isn’t it? As it effects the content of the patch and ultimately the result I may achieve.

So, can anyone help me on this.
@sgugger @amyeroberts @all @everyone @highlight

Hi @raygx!

If you’re applying augmentation to the images, I’d recommend not using the image processors at all! As you note, they’re pretty slow (something we’re trying to work on) but working directly with tf.image will be a lot easier and faster.

We have an example of training transformers models with tensorflow for image classification here: transformers/examples/tensorflow/image-classification/run_image_classification.py at main · huggingface/transformers · GitHub

  1. The resize method from tensorflow and PIL is working bit differently (even if it was Bilinear in both; how do i know this? After is did do_resize=False I got similar result).

Yes, unfortunately there isn’t a 1:1 correspondence with resizing algorithms across frameworks. As we import models from different frameworks (tf, pt, jax) and the image processors are meant to be agnostic to this, we can’t always resolve the differences.

  1. I can’t reconstruct the output from ImageProcessor with np.reshape and tf.reshape. The reshaping method used by ImageProcessor is working differently such that reconstruction is not possible. I did tried all of ‘CFA’ order in np.reshape.

Could you provide an example of how the image processor is being called and how the outputs are being reshaped?

@amyeroberts Thanks for your reply! I just figured what is happening.

But just for the record, I am gonna add this.

I was doing Reshape. But It turns out, ImageProcessor is doing Transpose.
The solution I found for myself is:

normalized = ( rescaled_image - 0.5 ) / 0.5 
tf.transpose(normalized , perm=(2,0,1))

which results in

<tf.Tensor: shape=(10,), dtype=float32, numpy=
array([-0.5384153 , -0.56822723, -0.50992393, -0.6224089 , -0.58579427,
       -0.62336934, -0.6540616 ,  0.17835152,  0.34901977,  0.8013607 ],
      dtype=float32)>

Thanks.