Why TrOCR processor has a feature extractor?

When we are using an image transformer, why do we need a feature extractor (TrOCR processor is Feature Extractor + Roberta Tokenizer)?
And I saw the output image given by the processor, it’s the same as the original image, just the shape is changed, it resized smaller.
@nielsr is the processor doing any type of image preprocessing ?.
I tried a few image preprocessing techniques like binarising the image, adding white space to borders, a bit of denoising and it turns out to be of little to no help.
Can you please comment on that too


Yes models that take pixel values as an input have a feature extractor defined, that will apply some basic image preprocessing (typically resize the image to a particular size + normalize the color channels).

TrOCR for instance expects every image to be of size 224x224.

Note that many models show better performance by introducing image augmentations (such as random flipping, cropping, etc.) during training. This is not included in the feature extractors, for that you can use packages like torchvision or albumentations.

thanks a lot for the explanation @nielsr
can you please also comment on why FeatureExtractor has from_pretrained class method
‘’‘def from_pretrained(cls, pretrained_model_name_or_path, **kwargs):’’’
is it a model, I don’t see
class AutoFeatureExtractor: subclassing nn.Module
and if it has to “apply some basic image preprocessing (typically resize the image to a particular size + normalize the color channels).”
it can be done as a vision.transforms script, so what is AutoFeatureExtractor
is it a model, which learns to do preprocessing, where can I read about its architecture

Yes feature extractors also have a from_pretrained method, to just load the same configuration as the one of a particular checkpoint on the hub.

e.g. if you do ViTFeatureExtractor.from_pretrained("google/vit-base-patch16-224"), it will make sure the size attribute of the feature extractor is set to 224. You could of course also just initialize it as feature_extractor = ViTFeatureExtractor(), as in this case, the feature extractor’s size attribute will be 224 by default as seen in the docs.

AutoFeatureExtractor is a class that aims to make it easier for people not having to specify a model-specific feature extractor. The Auto API will load the appropriate feature extractor by just specifying a model name from the hub. It’s a feature extractor, not a model. It will take care of the preprocessing.

1 Like

Hi @nielsr ,
I followed the step-by-step of TrOCR TrOCR-Doc. However, I faced a problem when running this line of code:

pixel_values = processor(images=image, return_tensors="pt").pixel_values

The error information is like:

Traceback (most recent call last):
  File "./trocr_test_base_printed.py", line 14, in <module>
    pixel_values = processor(images=image, return_tensors="pt").pixel_values
  File "/data/***/anaconda3/envs/hug_face/lib/python3.6/site-packages/transformers/models/trocr/processing_trocr.py", line 117, in __call__
    return self.current_processor(*args, **kwargs)
  File "/data/***/anaconda3/envs/hug_face/lib/python3.6/site-packages/transformers/models/vit/feature_extraction_vit.py", line 141, in __call__
    images = [self.normalize(image=image, mean=self.image_mean, std=self.image_std) for image in images]
  File "/data/***/anaconda3/envs/hug_face/lib/python3.6/site-packages/transformers/models/vit/feature_extraction_vit.py", line 141, in <listcomp>
    images = [self.normalize(image=image, mean=self.image_mean, std=self.image_std) for image in images]
  File "/data/***/anaconda3/envs/hug_face/lib/python3.6/site-packages/transformers/image_utils.py", line 149, in normalize
    return (image - mean) / std
ValueError: operands could not be broadcast together with shapes (384,384) (3,) 

I guess the problem is the version of transformers and the feature extractor, but I didn’t find the detailed version information. I’m now using the transformers 4.12.3

Could you help me about that?
Many thanks

can you check your image shape and report it should be 3 dimensional. check if it doesn’t

Thanks for your reply.

I tried a local colorful image with 3 dimensional, it work!! THANKS!!

However, when I tried the IAM image, it has the above-mentioned error. Even I tried the exact step-by-step guideline, it also has the above-mentioned error. Have you tried the step-by-step codes? Or do you have any idea how to handle the binary image input? I considered to repeat the 1 channel to 3 channel, but i’m not sure whether this is okay or not.

The step-by-step code is:

>>> from transformers import TrOCRProcessor, VisionEncoderDecoderModel
>>> import requests
>>> from PIL import Image

>>> processor = TrOCRProcessor.from_pretrained("microsoft/trocr-base-handwritten")
>>> model = VisionEncoderDecoderModel.from_pretrained("microsoft/trocr-base-handwritten")

>>> # load image from the IAM dataset
>>> url = "https://fki.tic.heia-fr.ch/static/img/a01-122-02.jpg"
>>> image = Image.open(requests.get(url, stream=True).raw).convert("RGB")

>>> pixel_values = processor(image, return_tensors="pt").pixel_values
>>> generated_ids = model.generate(pixel_values)

>>> generated_text = processor.batch_decode(generated_ids, skip_special_tokens=True)[0]

yes you got it right, np.repeat along the last dimension, should do the job

Yeah! ~ It works! Thanks a lot ! :hugs: :hugs: :hugs: