How can one visualize the Cross-Attention of a VisionEncoderDecoderModel?

I think the title says it all.
I’m trying to highlight the attention results between my image and the text generated by the model.

however since I don’t gras fully the concept of attention and the model is complcated by nature I don’t understand what I must take to visualize the cross-attention.

The idea would be to draw a heatmap over the source image to visualize the attention of what the features are and the words of the output sentence that these features are related to.

Thank you for your help !


Did you figure out how to do it with a Hugging Face model?

I tried to put the attention maps on the image as a mask after rescaling but it wielded no results and I didn’t try any further for now.