Running example code
from transformers import AutoImageProcessor, ViTMAEModel
from PIL import Image
import requests
url = "http://images.cocodataset.org/val2017/000000039769.jpg"
image = Image.open(requests.get(url, stream=True).raw)
image_processor = AutoImageProcessor.from_pretrained("facebook/vit-mae-base")
model = ViTMAEModel.from_pretrained("facebook/vit-mae-base")
inputs = image_processor(images=image, return_tensors="pt")
print(model(**inputs).last_hidden_state.size()) # [1, 50, 768]
print(model(**inputs).mask.size()) # [1, 196]
Where does 50 come from? Isn’t 196 the sequence length = number of patches, so the second dimension of last_hidden_state should be 196 or 197 if cls token is added?