ViTMAEModel With model.eval(), get two different representations?

MMing · August 10, 2022, 8:10am

My code as following, how to get equal encodings from same image?

import torch
from transformers import ViTMAEModel

pixel_value = torch.randn([1, 3, 224, 224])

model = ViTMAEModel.from_pretrained("facebook/vit-mae-base").eval()  # convert to evaluate mode.

encoding_a = model(pixel_value).last_hidden_state[:, 1:, :].mean(dim=1)
encoding_b = model(pixel_value).last_hidden_state[:, 1:, :].mean(dim=1)

# print(encoding_a == encoding_b )
assert torch.equal(encoding_a , encoding_b )  # not equal!

MMing · August 10, 2022, 9:23am

I have handle it problem: just set ‘mask_ratio = 0’, I don’t known whether this way is apply to inference?

model = ViTMAEModel.from_pretrained("facebook/vit-mae-base", mask_ratio=0.0).eval()  # convert to evaluate mode.

nielsr · August 10, 2022, 9:27am

Hi,

The model internally generates a random boolean mask as seen here.

To make it reproducable, one can provide a noise argument to the forward method (to make sure the same boolean mask is applied):

import numpy as np
import torch
from transformers import ViTMAEModel

pixel_values = torch.randn([1, 3, 224, 224])

model = ViTMAEModel.from_pretrained("facebook/vit-mae-base")

noise = num_patches = int((model.config.image_size // model.config.patch_size) ** 2)
noise = np.random.uniform(size=(1, num_patches))

encoding_a = model(pixel_values, noise=torch.from_numpy(noise)).last_hidden_state[:, 1:, :].mean(dim=1)
encoding_b = model(pixel_values, noise=torch.from_numpy(noise)).last_hidden_state[:, 1:, :].mean(dim=1)
assert torch.equal(encoding_a , encoding_b)

MMing · August 10, 2022, 12:24pm

Thanks your reply.

But I found that we just need to reload the model with parameter ‘mask_ratio = 0.0’( equates to let model see all patchs), then could get reproducable encoding from same image.

Topic		Replies	Views
Calling ViTMAEModel with embeddings and encoder Beginners	2	287	January 31, 2024
Using ViTMAEModel as an encoder for a UNet decoder for semantic segmentation 🤗Transformers	0	170	June 9, 2024
Inference with VitMAE by providing a mask 🤗Transformers	0	286	January 3, 2024
Combining encoder from one model and a decoder for another for image reconstruction Beginners	0	340	December 15, 2022
Using EncoderDecoderModel 🤗Transformers	4	1067	October 28, 2021

ViTMAEModel With model.eval(), get two different representations?

Related topics