VisionEncoderDecoder X-Attn Question

@nielsr - I am trying to understand the X-attn code in VisionEncoderDecoder architecture. I can see the comment in the code " Cross-attention layers are automatically added to the decoder and should be fine-tuned on a downstream generative task, like image captioning."

What is the best place to see the actual implementation of the X-attn between V Encoder and L Decoder?

Please advice.

Thanks
Prithivi

You can see here that the BertSelfAttention class can be used both for self-attention and cross-attention, depending on whether or not encoder_hidden_states are provided.

1 Like

Thanks @nielsr. Just to close out the loading and fine-tuning models with VisionEcoderDecoder only finetunes the X-Attn layers - is my assumption right? if not is there an explicit way to freeze the pretrained vision and language models?

Please advice

No, all parameters get updated. It’s just that the weights of the cross-attention layers are the only ones that start from scratch, all other weights are already pre-trained, and those get fine-tuned.

1 Like

Ok thank you