I’m looking to create a Vision Transformer (ViT) using Inception V3 as the backbone. For an input image of size 500x500, Inception V3 outputs feature maps with dimensions [1, 2048, 14, 14]
.
How can I feed these feature maps into a ViT? I came across the ViTHybridForImageClassification
class, which seems relevant, but I’m unsure how to implement it with the Inception V3 backbone.
Here’s the code I used to extract the intermediate feature maps from Inception V3:
model = timm.create_model('inception_v3', pretrained=True, features_only=True)
x = model(torch.randn(1, 3, 500, 500))
print(x.shape) # this return torch.Size([1, 2048, 14, 14])
Any guidance would be appreciated.