How can one visualize the Cross-Attention of a VisionEncoderDecoderModel?

I think the title says it all.
I’m trying to highlight the attention results between my image and the text generated by the model.

however since I don’t gras fully the concept of attention and the model is complcated by nature I don’t understand what I must take to visualize the cross-attention.

The idea would be to draw a heatmap over the source image to visualize the attention of what the features are and the words of the output sentence that these features are related to.

