How to properly train BEiT for Masked Image Modeling

I’m trying to pretrain BEiT using some images I have, but I’m not sure how to pass the masked images and their originals to the algorithm. As I understand, pixel_values is where I feed the original image, and bool_masked_pos where I say which patches of the image were masked. So I used the example code just to see if it works.

from transformers import BeitFeatureExtractor, BeitForMaskedImageModeling
from PIL import Image
import requests

url = ""
image =, stream=True).raw)

feature_extractor = BeitFeatureExtractor.from_pretrained("microsoft/beit-base-patch16-224-pt22k")
model = BeitForMaskedImageModeling.from_pretrained("microsoft/beit-base-patch16-224-pt22k")

# function to make the mask arrays. I'll mask 10 patches of the 14*14 existing patches (224/16 = 14) in the image
def make_masks(image_shape = 224, patch_shape = 16, number_of_masked_patches = 10):
    x = np.arange(0,(image_shape/patch_shape)**2)
    zeros = np.zeros_like(x)
    masked_pos = np.random.choice(np.arange((image_shape/patch_shape)**2), number_of_masked_patches, replace = False)
    zeros[masked_pos.astype(np.uint8)] = 1
    return zeros

masks = make_masks() # added this

inputs = feature_extractor(images=image, return_tensors="pt")
inputs["bool_masked_pos"] =  torch.BoolTensor(masks.reshape(1,-1)) #added this
outputs = model(**inputs)
logits = outputs.logits

The code runs, but I don’t have a loss as output, just the logits. Will this work when I’m training the model or do I have to pass the inputs and masks in another way (so I can have an actual loss and therefore train the model)?