I don’t understand why ViTMAEModel produces different results when it is called “directly” vs. using models’ embeddings and encoder properties. Mask ratio was set to 0.
from transformers import AutoImageProcessor, ViTMAEModel
import torch
from PIL import Image
import requests
url = "http://images.cocodataset.org/val2017/000000039769.jpg"
image = Image.open(requests.get(url, stream=True).raw)
image_processor = AutoImageProcessor.from_pretrained("facebook/vit-mae-base")
model = ViTMAEModel.from_pretrained("facebook/vit-mae-base")
inputs = image_processor(images=image, return_tensors="pt")
model.config.mask_ratio = 0
emb, _, _ = model.embeddings(**inputs)
#1 False
print(torch.all(model.encoder(emb, output_hidden_states=True).hidden_states[12] == model(**inputs,output_hidden_states=True).hidden_states[12]).item())
#2 False
norm = torch.nn.LayerNorm(model.config.hidden_size, eps=model.config.layer_norm_eps)
print(torch.all(model(**inputs)[0] == norm(model(**inputs, output_hidden_states=True).hidden_states[12])).item())
Why in case 1 hidden states are not equal and in case 2 layer normalization of last hidden state obtained with encoder not equal to last_hidden_state obtained “directly”?