Vision Transformer reconstruct image

from the output of a ViT how I can reconstruct an image of features like the output of a CNN?

I need this kind of representation for a network that uses that representation for the information information.

I need to obtain an output like the features images in the video of that image

to use that image as input for another piece of the network.


See my notebook here: Transformers-Tutorials/Visualize_self_attention_of_DINO.ipynb at master · NielsRogge/Transformers-Tutorials · GitHub.