I am having trouble in interpreting the hidden_state and last_hidden_state indexing with respect to transformer models for computer vision
which layer output is the last_hidden state. Example in a swin transformer tiny the hidden_state returns a tuple of 5 with sizes 3136x96, 784x192, 196x38, 49x768 and 49x768 respectively. I tried to view them but I was not able to get the last_hidden_state from the tuples of hidden_state.
Similar problem I faced in VIT models too
Please can anyone help in understanding these embeddings from Model output class specially for transformers of computer vision as I am trying to find some interpretibility from the model outputs.
Thanks in advanced