@nielsr - I am trying to understand the X-attn code in VisionEncoderDecoder architecture. I can see the comment in the code " Cross-attention layers are automatically added to the decoder and should be fine-tuned on a downstream generative task, like image captioning."
What is the best place to see the actual implementation of the X-attn between V Encoder and L Decoder?
Please advice.
Thanks
Prithivi