Size of last_hidden_state and mask in ViTMAE

funtiklb · January 23, 2024, 5:30pm

Running example code

from transformers import AutoImageProcessor, ViTMAEModel
from PIL import Image
import requests

url = "http://images.cocodataset.org/val2017/000000039769.jpg"
image = Image.open(requests.get(url, stream=True).raw)

image_processor = AutoImageProcessor.from_pretrained("facebook/vit-mae-base")

model = ViTMAEModel.from_pretrained("facebook/vit-mae-base")

inputs = image_processor(images=image, return_tensors="pt")

print(model(**inputs).last_hidden_state.size()) # [1, 50, 768]
print(model(**inputs).mask.size()) # [1, 196]

Where does 50 come from? Isn’t 196 the sequence length = number of patches, so the second dimension of last_hidden_state should be 196 or 197 if cls token is added?

funtiklb · January 23, 2024, 6:05pm

Second question: model.config.mask_ratio is 0.75, however the ratio of zeros in mask is 176/196, which is 0.9?

nielsr · January 23, 2024, 9:26pm

Hi,

The 50 comes from the fact that the model uses a mask_ratio of 0.75 by default, and only the visible patch tokens are encoded. Hence the number of tokens being sent through the model is equal to int(math.ceil((1 - mask_ratio) * (num_patches + 1))).

In this case, we have (224 // 16)**2 = 196 + 1 = 197 patches being created (+1 for the cls token), hence we get int(math.ceil(0.25*197)) = 50.

The decoder of ViTMAE then takes in a sequence of encoded visible patches + a shared mask token for all masked patches, and reconstructs the masked ones.

Topic		Replies	Views
VivitModel last hidden states dimension Problem 🤗Transformers	0	48	July 11, 2024
Hidden_states Transformers for computer vision 🤗Transformers	0	426	July 21, 2022
Calling ViTMAEModel with embeddings and encoder Beginners	2	289	January 31, 2024
How to get a fixed size embedding from the last hidden state of vision models? 🤗Transformers	0	801	April 28, 2022
Can not understand the sequence length and hidden size of the BEiT model 🤗Transformers	0	226	July 27, 2023

Size of last_hidden_state and mask in ViTMAE

Related topics