I have been trying to visualize attention maps of vision transformers, I was able to do so for ViT using the attention rollout method.
However when I tried doing so for SwinV2, I observed that the shapes of attention states tensors in Swinv2Large model were:
torch.Size([16, 144, 144])
torch.Size([4, 144, 144])
torch.Size([1, 144, 144])
torch.Size([1, 36, 36])
So, the attention matrices were of different sizes so could not be recursively multiplied to get the attention flow. I could do it for the first three attention states, but I am not sure that it is the right approach.
TLDR: I want to know how to visualize attention maps from SwinV2 transformers given that the shapes of attention are not the same. Is there a paper or a code repository I could refer to?
Thank you for your help.