I want to use pytorch and the hugging face Library to load a pre-trained model and freeze the weights. Then I want to add new Transformer blocks b that take in the hidden States from each of the pre-trained Transformer layers at layer B and the previous output from the added blocks at b -1.
Conceptually it’s sort of like Lora but instead of just adapting the pre-trained LLM I would like to add a modal input to these “side layers”.
Is this a dumb idea? Will this not work? What am I missing?
I’m too stupid to figure out how to add a modality to open flamingo so I was trying to naively figure out my own example.