Size of last_hidden_state and mask in ViTMAE

Hi,

The 50 comes from the fact that the model uses a mask_ratio of 0.75 by default, and only the visible patch tokens are encoded. Hence the number of tokens being sent through the model is equal to int(math.ceil((1 - mask_ratio) * (num_patches + 1))).

In this case, we have (224 // 16)**2 = 196 + 1 = 197 patches being created (+1 for the cls token), hence we get int(math.ceil(0.25*197)) = 50.

The decoder of ViTMAE then takes in a sequence of encoded visible patches + a shared mask token for all masked patches, and reconstructs the masked ones.

1 Like