VisionEncoderDecoder X-Attn Question

prithivida · June 17, 2022, 2:51am

@nielsr - I am trying to understand the X-attn code in VisionEncoderDecoder architecture. I can see the comment in the code " Cross-attention layers are automatically added to the decoder and should be fine-tuned on a downstream generative task, like image captioning."

What is the best place to see the actual implementation of the X-attn between V Encoder and L Decoder?

Please advice.

Thanks
Prithivi

nielsr · June 18, 2022, 8:07am

You can see here that the BertSelfAttention class can be used both for self-attention and cross-attention, depending on whether or not encoder_hidden_states are provided.

prithivida · June 20, 2022, 7:30am

Thanks @nielsr. Just to close out the loading and fine-tuning models with VisionEcoderDecoder only finetunes the X-Attn layers - is my assumption right? if not is there an explicit way to freeze the pretrained vision and language models?

Please advice

nielsr · June 20, 2022, 8:09am

No, all parameters get updated. It’s just that the weights of the cross-attention layers are the only ones that start from scratch, all other weights are already pre-trained, and those get fine-tuned.

prithivida · June 20, 2022, 9:15am

Ok thank you

Topic		Replies	Views
Using EncoderDecoderModel 🤗Transformers	4	1103	October 28, 2021
Adding cross-attention to custom models 🤗Transformers	2	3628	October 21, 2022
How to get cross-attention values of T5? 🤗Transformers	2	3907	October 9, 2020
Gradual Unfreezing support for Fine tuning models 🤗Transformers	3	4043	August 26, 2020
Partially fine-tuning an encoder in an encoder-decoder transformer 🤗Accelerate	0	1289	August 17, 2021

VisionEncoderDecoder X-Attn Question

Related topics