ViTMAEModel With model.eval(), get two different representations?

My code as following, how to get equal encodings from same image?

import torch
from transformers import ViTMAEModel

pixel_value = torch.randn([1, 3, 224, 224])

model = ViTMAEModel.from_pretrained("facebook/vit-mae-base").eval()  # convert to evaluate mode.

encoding_a = model(pixel_value).last_hidden_state[:, 1:, :].mean(dim=1)
encoding_b = model(pixel_value).last_hidden_state[:, 1:, :].mean(dim=1)

# print(encoding_a == encoding_b )
assert torch.equal(encoding_a , encoding_b )  # not equal!

I have handle it problem: just set ‘mask_ratio = 0’, I don’t known whether this way is apply to inference?

model = ViTMAEModel.from_pretrained("facebook/vit-mae-base", mask_ratio=0.0).eval()  # convert to evaluate mode.

Hi,

The model internally generates a random boolean mask as seen here.

To make it reproducable, one can provide a noise argument to the forward method (to make sure the same boolean mask is applied):

import numpy as np
import torch
from transformers import ViTMAEModel

pixel_values = torch.randn([1, 3, 224, 224])

model = ViTMAEModel.from_pretrained("facebook/vit-mae-base")

noise = num_patches = int((model.config.image_size // model.config.patch_size) ** 2)
noise = np.random.uniform(size=(1, num_patches))

encoding_a = model(pixel_values, noise=torch.from_numpy(noise)).last_hidden_state[:, 1:, :].mean(dim=1)
encoding_b = model(pixel_values, noise=torch.from_numpy(noise)).last_hidden_state[:, 1:, :].mean(dim=1)
assert torch.equal(encoding_a , encoding_b)

Thanks your reply.

But I found that we just need to reload the model with parameter ‘mask_ratio = 0.0’( equates to let model see all patchs), then could get reproducable encoding from same image.