Vision transformer with Resnet backbone

Good time of day,

I am trying to build a VIT model (not using pre-trained checkpoint) with a ResNet backbone (trained). How can one setup VIT model so it would take hidden states of ResNet?

For example timm.models.vision_transformer_hybrid has HybridEmbed, which allows one to use a backbone with VIT. Is there something similar here? or Does one need to go directly to code and change the patch embedding of ViT?