@nielsr - I am trying to understand the X-attn code in VisionEncoderDecoder architecture. I can see the comment in the code " Cross-attention layers are automatically added to the decoder and should be fine-tuned on a downstream generative task, like image captioning."
What is the best place to see the actual implementation of the X-attn between V Encoder and L Decoder?
You can see here that the BertSelfAttention class can be used both for self-attention and cross-attention, depending on whether or not encoder_hidden_states are provided.
Thanks @nielsr. Just to close out the loading and fine-tuning models with VisionEcoderDecoder only finetunes the X-Attn layers - is my assumption right? if not is there an explicit way to freeze the pretrained vision and language models?
No, all parameters get updated. It’s just that the weights of the cross-attention layers are the only ones that start from scratch, all other weights are already pre-trained, and those get fine-tuned.