Extract visual and contextual features from images

nielsr · August 20, 2021, 12:01pm

Hi,

The feature extractors (like ViTFeatureExtractor, DEiTFeatureExtractor) can be used to prepare images for Transformer-based models (ViT and DEiT respectively). They mainly do 2 things: resize images to a given size and normalize the channels. After using the feature extractor, an image is turned into a PyTorch tensor of shape (batch_size, num_channels, height, width), which might be (1, 3, 224, 224). Next, this tensor is provided to a Transformer that turns it into contextual features. For prediction, one typically simply places a linear classification head (nn.Linear) on top of the contextual features.

You might be interested in this project: GitHub - him4318/Transformer-ocr: Handwritten text recognition using transformers.. It’s based on DETR, which is available in HuggingFace Transformers. Note that DETR itself consists of a convolutional backbone + encoder-decoder Transformer.

Instead of using classification heads for predicting class labels + bounding boxes (as was done in the original DETR as it was meant for object detection), he simply adds a linear layer on top of the Transformer outputs, which act as a “language modeling decoder” (similar to was is done in models like BERT during pre-training). This language modeling decoder maps the contextual features of the Transformer to actual words. This language modeling decoder is defined here.

Topic		Replies	Views
Why TrOCR processor has a feature extractor? Beginners	8	1454	November 25, 2021
Image Features as Model Input Beginners	2	940	November 18, 2020
Using trasnsformer to get image features 🤗Transformers	3	3378	March 20, 2024
Img2seq model with pretrained weights Beginners	7	1248	November 18, 2021
Get output embedding of FeatureExtractor 🤗Transformers	1	716	April 20, 2021

Extract visual and contextual features from images

Related topics