I was trying to use masked image modeling in huggingface and I saw ViTForMaskedImageModeling
in the documentation but I did not understand how it reconstructs the original image loss, reconstructed_pixel_values = outputs.loss, outputs.reconstruction
also, it doesn’t reconstruct the original image correctly. It gives me noise.
url = "http://images.cocodataset.org/val2017/000000039769.jpg"
image = Image.open(requests.get(url, stream=True).raw)
image_processor = AutoImageProcessor.from_pretrained("google/vit-base-patch16-224-in21k")
model = ViTForMaskedImageModeling.from_pretrained("google/vit-base-patch16-224-in21k")
num_patches = (model.config.image_size // model.config.patch_size) ** 2
pixel_values = image_processor(images=image, return_tensors="pt").pixel_values
# create random boolean mask of shape (batch_size, num_patches)
bool_masked_pos = torch.randint(low=0, high=2, size=(1, num_patches)).bool()
outputs = model(pixel_values, bool_masked_pos=bool_masked_pos)
loss, reconstructed_pixel_values = outputs.loss, outputs.reconstruction
reconstructed_pixel_values = reconstructed_pixel_values.detach().numpy()
reconstructed_pixel_values = np.transpose(reconstructed_pixel_values[0], (1, 2, 0))
plt.imshow(reconstructed_pixel_values)
plt.show()