How can one visualize the Cross-Attention of a VisionEncoderDecoderModel?

I tried to put the attention maps on the image as a mask after rescaling but it wielded no results and I didn’t try any further for now.