How can one visualize the Cross-Attention of a VisionEncoderDecoderModel?

I think the title says it all.
I’m trying to highlight the attention results between my image and the text generated by the model.

however since I don’t gras fully the concept of attention and the model is complcated by nature I don’t understand what I must take to visualize the cross-attention.

The idea would be to draw a heatmap over the source image to visualize the attention of what the features are and the words of the output sentence that these features are related to.

Thank you for your help !