Using Inception V3 as Backbone for Vision Transformer

I’m looking to create a Vision Transformer (ViT) using Inception V3 as the backbone. For an input image of size 500x500, Inception V3 outputs feature maps with dimensions [1, 2048, 14, 14].

How can I feed these feature maps into a ViT? I came across the ViTHybridForImageClassification class, which seems relevant, but I’m unsure how to implement it with the Inception V3 backbone.

Here’s the code I used to extract the intermediate feature maps from Inception V3:

model = timm.create_model('inception_v3', pretrained=True, features_only=True)
x = model(torch.randn(1, 3, 500, 500))
print(x.shape)   # this return torch.Size([1, 2048, 14, 14])

Any guidance would be appreciated.

1 Like