Size of last_hidden_state and mask in ViTMAE

nielsr · January 23, 2024, 9:26pm

Hi,

The 50 comes from the fact that the model uses a mask_ratio of 0.75 by default, and only the visible patch tokens are encoded. Hence the number of tokens being sent through the model is equal to int(math.ceil((1 - mask_ratio) * (num_patches + 1))).

In this case, we have (224 // 16)**2 = 196 + 1 = 197 patches being created (+1 for the cls token), hence we get int(math.ceil(0.25*197)) = 50.

The decoder of ViTMAE then takes in a sequence of encoded visible patches + a shared mask token for all masked patches, and reconstructs the masked ones.

Topic		Replies	Views
VivitModel last hidden states dimension Problem 🤗Transformers	0	45	July 11, 2024
Hidden_states Transformers for computer vision 🤗Transformers	0	411	July 21, 2022
Calling ViTMAEModel with embeddings and encoder Beginners	2	281	January 31, 2024
How to get a fixed size embedding from the last hidden state of vision models? 🤗Transformers	0	779	April 28, 2022
Can not understand the sequence length and hidden size of the BEiT model 🤗Transformers	0	222	July 27, 2023

Size of last_hidden_state and mask in ViTMAE

Related topics