Hi,
The feature extractors (like ViTFeatureExtractor, DEiTFeatureExtractor) can be used to prepare images for Transformer-based models (ViT and DEiT respectively). They mainly do 2 things: resize images to a given size and normalize the channels. After using the feature extractor, an image is turned into a PyTorch tensor of shape (batch_size, num_channels, height, width), which might be (1, 3, 224, 224). Next, this tensor is provided to a Transformer that turns it into contextual features. For prediction, one typically simply places a linear classification head (nn.Linear) on top of the contextual features.
You might be interested in this project: GitHub - him4318/Transformer-ocr: Handwritten text recognition using transformers.. It’s based on DETR, which is available in HuggingFace Transformers. Note that DETR itself consists of a convolutional backbone + encoder-decoder Transformer.
Instead of using classification heads for predicting class labels + bounding boxes (as was done in the original DETR as it was meant for object detection), he simply adds a linear layer on top of the Transformer outputs, which act as a “language modeling decoder” (similar to was is done in models like BERT during pre-training). This language modeling decoder maps the contextual features of the Transformer to actual words. This language modeling decoder is defined here.