Good time of day,
I am trying to build a VIT model (not using pre-trained checkpoint) with a ResNet backbone (trained). How can one setup VIT model so it would take hidden states of ResNet?
For example timm.models.vision_transformer_hybrid has HybridEmbed, which allows one to use a backbone with VIT. Is there something similar here? or Does one need to go directly to code and change the patch embedding of ViT?