Calling ViTMAEModel with embeddings and encoder

funtiklb · January 30, 2024, 10:44am

I don’t understand why ViTMAEModel produces different results when it is called “directly” vs. using models’ embeddings and encoder properties. Mask ratio was set to 0.

from transformers import AutoImageProcessor, ViTMAEModel
import torch
from PIL import Image
import requests

url = "http://images.cocodataset.org/val2017/000000039769.jpg"
image = Image.open(requests.get(url, stream=True).raw)

image_processor = AutoImageProcessor.from_pretrained("facebook/vit-mae-base")

model = ViTMAEModel.from_pretrained("facebook/vit-mae-base")

inputs = image_processor(images=image, return_tensors="pt")

model.config.mask_ratio = 0

emb, _, _ = model.embeddings(**inputs)

#1 False
print(torch.all(model.encoder(emb, output_hidden_states=True).hidden_states[12] == model(**inputs,output_hidden_states=True).hidden_states[12]).item())

#2 False
norm = torch.nn.LayerNorm(model.config.hidden_size, eps=model.config.layer_norm_eps)
print(torch.all(model(**inputs)[0] == norm(model(**inputs, output_hidden_states=True).hidden_states[12])).item())

Why in case 1 hidden states are not equal and in case 2 layer normalization of last hidden state obtained with encoder not equal to last_hidden_state obtained “directly”?

funtiklb · January 31, 2024, 11:19am

Ok, I solved this. Even mask_ratio is set 0, noise argument need to be provided for results to be reproducible and layer_normalization is learnable parameter. Following produces True in both cases

noise = num_patches = int((model.config.image_size // model.config.patch_size) ** 2)
noise = np.random.uniform(size=(1, num_patches))

emb, _, _ = model.embeddings(**inputs,noise=torch.from_numpy(noise))

#1 True
print(torch.all(model.encoder(emb, output_hidden_states=True).hidden_states[12] == model(**inputs,noise=torch.from_numpy(noise),output_hidden_states=True).hidden_states[12]).item())

#2 True
print(torch.all(model(**inputs,noise=torch.from_numpy(noise))[0] == model.layernorm(model(**inputs, output_hidden_states=True,noise=torch.from_numpy(noise)).hidden_states[12])).item())

system · January 31, 2024, 11:20pm

This topic was automatically closed 12 hours after the last reply. New replies are no longer allowed.

Topic		Replies	Views
ViTMAEModel With model.eval(), get two different representations? 🤗Transformers	3	306	August 10, 2022
Using ViTMAEModel as an encoder for a UNet decoder for semantic segmentation 🤗Transformers	0	171	June 9, 2024
Size of last_hidden_state and mask in ViTMAE Beginners	2	341	January 23, 2024
Combining encoder from one model and a decoder for another for image reconstruction Beginners	0	341	December 15, 2022
Image Embedding from PaliGemma Model Beginners	7	584	March 5, 2025

Calling ViTMAEModel with embeddings and encoder

Related topics