How do I add new separate layers to a pretained model to add a modality?

I want to use pytorch and the hugging face Library to load a pre-trained model and freeze the weights. Then I want to add new Transformer blocks b that take in the hidden States from each of the pre-trained Transformer layers at layer B and the previous output from the added blocks at b -1.

Conceptually it’s sort of like Lora but instead of just adapting the pre-trained LLM I would like to add a modal input to these “side layers”.

Is this a dumb idea? Will this not work? What am I missing?

I’m too stupid to figure out how to add a modality to open flamingo so I was trying to naively figure out my own example.