ConvNextImageProcessor weird resize behaviour when input image is 224x224

Hi!

I am working on an image classification task and run into issues that the result of my trainer.predict() and the pipeline(…)results showed huge differences. I could identify the issue is in the image processor.

I am using ConvNeXTV2 which is using the ConvNextImageProcessor. My original input images are of the size 1200x1920px. As there is only relevant information in the center of the image I cropped manually to 1100x600px, resized to 224x224 and used that is input for training and validation. The results look good there.

I am using a 224 type (facebook/convnextv2-tiny-22k-224) and found out that the ConvNextImageProcessor is behaving differently when using 224 or 384 input size.

For shortest_edge=384 the images are just resized as they are which would be my expected behaviour. But for shortest_edge=224 there is more going on. The image is resized dependent on the crop_pct factor and then a square of 224x224 is cropped out and used.

In my case I am losing relevant information and the score goes down massively. Why is the the ConvNextImageProcessor behaving differently depending on the shortest_edge?

Also for training I just resized the images to 224x224. For inference the picture looks completely different when using the ConvNextImageProcessor as it is resizing with a locked ratio and then just cropping.

What is the right approach to handle that? Should I adapt the preprocessing of my training images to fit the ConvNextImageProcessor behaviour? But how should I know what exactly happens in each of the ImageProcessors.

Or should I just use a model that uses shortest_edge=384?

Hope you can help.

Hi,

The ConvNextImageProcessor class replicates the original data transformations during evaluation (source).

  • If the size of the images is 384 or higher, then the authors decided to square images and normalize them.
  • If the size of the images is smaller, then they first resize the shorter edge to 0.875*the shorter edge, then perform center cropping, then normalizing.

If you prefer to use the pipeline at inference time, then it’s advised to use the same preprocessing settings as the image processor (which the pipeline uses underneath) during training, so that both align.

Hi!

Thanks for the fast reply!

This means, if I don’t want my input images to be cropped I can either prevent it by not using a pipeline and use my own preprocessing, or I use a 384 model. Right?

Currently for training I created my own transforms for train (some augmentation, cropping, resize) and validation (only resize) dataset.

What would be the right approach here? Should I have a look at the code and replicate what is there in my own transforms for training and validation? Or should I probably use the build_transform method that you were referencing in the code? I guess I can’t use the preprocess method that comes with the transformers ConvNextImageProcessor as it does not return a transforms object.

Thank you!