In the example document of VivitModel Video Vision Transformer (ViViT) (huggingface.co), why the last_hidden_states.shape is 3137 but not 3136? What is the additional 1 dimension?
Based on the default parameters, image_size=224, num_frames=32, tubelet_size=[2,16,16], then number of tubelets is (224/16)x(224/16)x(32/2) = 14x14x16 = 3136.
# forward pass
outputs = model(**inputs)
last_hidden_states = outputs.last_hidden_state
list(last_hidden_states.shape)
[1, 3137, 768]