Using Inception V3 as Backbone for Vision Transformer

asifdegr8 · October 13, 2024, 7:35pm

I’m looking to create a Vision Transformer (ViT) using Inception V3 as the backbone. For an input image of size 500x500, Inception V3 outputs feature maps with dimensions [1, 2048, 14, 14].

How can I feed these feature maps into a ViT? I came across the ViTHybridForImageClassification class, which seems relevant, but I’m unsure how to implement it with the Inception V3 backbone.

Here’s the code I used to extract the intermediate feature maps from Inception V3:

model = timm.create_model('inception_v3', pretrained=True, features_only=True)
x = model(torch.randn(1, 3, 500, 500))
print(x.shape)   # this return torch.Size([1, 2048, 14, 14])

Any guidance would be appreciated.

Topic		Replies	Views
Vision transformer with Resnet backbone 🤗Transformers	0	973	November 17, 2021
Vision Transformer reconstruct image 🤗Transformers	2	1098	July 21, 2022
Can you provide an example of best practices for incorporating a pretrained HuggingFace Vision Transformer (ViT) into a PyTorch Lightning module? Models	0	72	September 9, 2024
Extract visual and contextual features from images Models	5	4363	August 27, 2021
How to plot an attention map for Vision Transformer model Beginners	0	2092	April 12, 2024

Using Inception V3 as Backbone for Vision Transformer

Related topics