Image classification: Why use both a transform and a processor to preprocess images?

This huggingface image-classification tutorial ( notebooks/examples/image_classification.ipynb at main · huggingface/notebooks · GitHub ) uses torchvision transforms to preproces the images. In particular, there are val_transforms that resize, crop and normalise the images in the test and validation datasets:
val_transforms = Compose(
[
Resize(size),
CenterCrop(crop_size),
ToTensor(),
normalize,
]
)
My question is: why do we move this responsibility from the processor
image_processor = AutoImageProcessor.from_pretrained(model_checkpoint)
to the val_transforms? Or are we doing the preprocessing twice, as the image_processor is also handed to the trainer as the tokenizer? Final question: do I still need to apply both the val_transforms and the processor at inference?

hi @Steyn-vanLeeuwen
I’m quoting from the tutorial

You might wonder why we pass along the image_processor as a tokenizer when we already preprocessed our data. This is only to make sure the image processor configuration file (stored as JSON) will also be uploaded to the repo on the hub.

I think you’re not doing two different preprocessing. You just take some numbers from image_processor to pass to other two functions:

normalize = Normalize(mean=image_processor.image_mean, std=image_processor.image_std)
if "height" in image_processor.size:
    size = (image_processor.size["height"], image_processor.size["width"])
    crop_size = size
    max_size = None
elif "shortest_edge" in image_processor.size:
    size = image_processor.size["shortest_edge"]
    crop_size = (size, size)
    max_size = image_processor.size.get("longest_edge")

For inference you can apply just image_processor as explained in the tutorial.

1 Like

Hi @mahmutc ,

Thanks for your help!

However, after digging into the code a bit more, I don’t think the tutorial has got it quite right.

The val_transforms performs resize from pytorchvision, which matches the smallest edge of the image to size and scales the other edge while maintaining the aspectratio. The image is then cropped by CenterCrop to get an image with dimensions (size, size).

The processor, on the other hand, preforms the resize transform from pillow, which just scales the image sides to the dimensions required by the model.

When applied to an image with width unequal to height, the difference between these two methods is that the first will crop a part of the image away, while the second method will compress the image more in one direction compared to the other direction.

I don’t know how big the difference in performance will be when using slightly different preprocessing during inference compared to training. But in my opinion it’s cleaner to always use the same preprocessing.

What do you think?

preprocessor:

pillow resize:

hi fair point. I’m not sure if adding ratio to RandomResizedCrop makes the model more robust.

RandomResizedCrop(from train_transforms) may not maintain the aspect ratio, similar to pillow’s resize.

https://pytorch.org/vision/main/generated/torchvision.transforms.RandomResizedCrop.html#torchvision.transforms.RandomResizedCrop

Another point: processor is related to pretrained model. I wouldn’t be surprised if it performs better.

1 Like

Ahh, I was not aware that RandomResizedCrop also scales the image.

I might change the validation_transforms to resemble the processor, but the train_trainsforms seem fine like this.

Many thanks for helping me understand! :smiley:

1 Like