Calling ViTMAEModel with embeddings and encoder

I don’t understand why ViTMAEModel produces different results when it is called “directly” vs. using models’ embeddings and encoder properties. Mask ratio was set to 0.

from transformers import AutoImageProcessor, ViTMAEModel
import torch
from PIL import Image
import requests

url = ""
image =, stream=True).raw)

image_processor = AutoImageProcessor.from_pretrained("facebook/vit-mae-base")

model = ViTMAEModel.from_pretrained("facebook/vit-mae-base")

inputs = image_processor(images=image, return_tensors="pt")

model.config.mask_ratio = 0

emb, _, _ = model.embeddings(**inputs)

#1 False
print(torch.all(model.encoder(emb, output_hidden_states=True).hidden_states[12] == model(**inputs,output_hidden_states=True).hidden_states[12]).item())

#2 False
norm = torch.nn.LayerNorm(model.config.hidden_size, eps=model.config.layer_norm_eps)
print(torch.all(model(**inputs)[0] == norm(model(**inputs, output_hidden_states=True).hidden_states[12])).item())

Why in case 1 hidden states are not equal and in case 2 layer normalization of last hidden state obtained with encoder not equal to last_hidden_state obtained “directly”?

Ok, I solved this. Even mask_ratio is set 0, noise argument need to be provided for results to be reproducible and layer_normalization is learnable parameter. Following produces True in both cases

noise = num_patches = int((model.config.image_size // model.config.patch_size) ** 2)
noise = np.random.uniform(size=(1, num_patches))

emb, _, _ = model.embeddings(**inputs,noise=torch.from_numpy(noise))

#1 True
print(torch.all(model.encoder(emb, output_hidden_states=True).hidden_states[12] == model(**inputs,noise=torch.from_numpy(noise),output_hidden_states=True).hidden_states[12]).item())

#2 True
print(torch.all(model(**inputs,noise=torch.from_numpy(noise))[0] == model.layernorm(model(**inputs, output_hidden_states=True,noise=torch.from_numpy(noise)).hidden_states[12])).item())

This topic was automatically closed 12 hours after the last reply. New replies are no longer allowed.