VivitModel last hidden states dimension Problem

BeefWellington · July 11, 2024, 4:02pm

In the example document of VivitModel Video Vision Transformer (ViViT) (huggingface.co), why the last_hidden_states.shape is 3137 but not 3136? What is the additional 1 dimension?

Based on the default parameters, image_size=224, num_frames=32, tubelet_size=[2,16,16], then number of tubelets is (224/16)x(224/16)x(32/2) = 14x14x16 = 3136.

# forward pass
outputs = model(**inputs)
last_hidden_states = outputs.last_hidden_state
list(last_hidden_states.shape)
[1, 3137, 768]

Topic		Replies	Views
Hidden_states Transformers for computer vision 🤗Transformers	0	426	July 21, 2022
Can not understand the sequence length and hidden size of the BEiT model 🤗Transformers	0	226	July 27, 2023
Transformer "output_hidden_states" format 🤗Transformers	3	699	July 9, 2023
How to get a fixed size embedding from the last hidden state of vision models? 🤗Transformers	0	805	April 28, 2022
Export VIT model to onnx 🤗Hub	2	2145	June 23, 2022

VivitModel last hidden states dimension Problem

Related topics