Vision Transformer reconstruct image

Dear,
from the output of a ViT how I can reconstruct an image of features like the output of a CNN?

I need this kind of representation for a network that uses that representation for the information information.