Size of last_hidden_state and mask in ViTMAE

Running example code

from transformers import AutoImageProcessor, ViTMAEModel
from PIL import Image
import requests

url = "http://images.cocodataset.org/val2017/000000039769.jpg"
image = Image.open(requests.get(url, stream=True).raw)

image_processor = AutoImageProcessor.from_pretrained("facebook/vit-mae-base")

model = ViTMAEModel.from_pretrained("facebook/vit-mae-base")

inputs = image_processor(images=image, return_tensors="pt")

print(model(**inputs).last_hidden_state.size()) # [1, 50, 768]
print(model(**inputs).mask.size()) # [1, 196]

Where does 50 come from? Isn’t 196 the sequence length = number of patches, so the second dimension of last_hidden_state should be 196 or 197 if cls token is added?

Second question: model.config.mask_ratio is 0.75, however the ratio of zeros in mask is 176/196, which is 0.9?

Hi,

The 50 comes from the fact that the model uses a mask_ratio of 0.75 by default, and only the visible patch tokens are encoded. Hence the number of tokens being sent through the model is equal to int(math.ceil((1 - mask_ratio) * (num_patches + 1))).

In this case, we have (224 // 16)**2 = 196 + 1 = 197 patches being created (+1 for the cls token), hence we get int(math.ceil(0.25*197)) = 50.

The decoder of ViTMAE then takes in a sequence of encoded visible patches + a shared mask token for all masked patches, and reconstructs the masked ones.