Extract visual and contextual features from images

nielsr · August 20, 2021, 1:50pm

Yes! The modeling file that you refer to is actually just a Vision Transformer but with a modified head, as it explicitly mentions:

"ViTSTR is basically a ViT that uses DeiT weights.
Modified head to support a sequence of characters prediction for STR."

So if you want to create a similar model using HuggingFace Transformers, you can, as ViT is available (documentation can be found here). We just need to define a similar classification head, as follows:

import torch.nn as nn
from transformers import ViTModel

class ViTSTR(nn.Module):
    def __init__(self, config, num_labels):
        super(ViTSTR, self).__init__()
        self.vit = ViTModel(config)
        self.head = nn.Linear(config.hidden_size, num_labels) if num_labels > 0 else nn.Identity()
        self.num_labels = num_labels

    def forward(self, pixel_values, seqlen=25):
        outputs = self.vit(pixel_values=pixel_values)
        # only keep seqlen last hidden states
        x = outputs.last_hidden_state[:, :seqlen]

        # batch_size, seqlen, embedding size
        b, s, e = x.size()
        x = x.reshape(b*s, e)
        x = self.head(x).view(b, s, self.num_labels)
        return x

You can then initialize the model as follows:

from transformers import ViTConfig

config = ViTConfig()
model = ViTSTR(config, num_labels=10)

Note that this doesn’t use any transfer learning. You can of course load any pre-trained ViTModel from the hub, by replacing self.vit = ViTModel(config) in the code above with self.vit = ViTModel.from_pretrained("google/vit-base-patch16-224") for example.

Topic		Replies	Views
Why TrOCR processor has a feature extractor? Beginners	8	1454	November 25, 2021
Image Features as Model Input Beginners	2	941	November 18, 2020
Using trasnsformer to get image features 🤗Transformers	3	3378	March 20, 2024
Img2seq model with pretrained weights Beginners	7	1248	November 18, 2021
Get output embedding of FeatureExtractor 🤗Transformers	1	716	April 20, 2021

Extract visual and contextual features from images

Related topics