I’m trying to pretrain BEiT using some images I have, but I’m not sure how to pass the masked images and their originals to the algorithm. As I understand, pixel_values
is where I feed the original image, and bool_masked_pos
where I say which patches of the image were masked. So I used the example code just to see if it works.
from transformers import BeitFeatureExtractor, BeitForMaskedImageModeling
from PIL import Image
import requests
url = "http://images.cocodataset.org/val2017/000000039769.jpg"
image = Image.open(requests.get(url, stream=True).raw)
feature_extractor = BeitFeatureExtractor.from_pretrained("microsoft/beit-base-patch16-224-pt22k")
model = BeitForMaskedImageModeling.from_pretrained("microsoft/beit-base-patch16-224-pt22k")
# function to make the mask arrays. I'll mask 10 patches of the 14*14 existing patches (224/16 = 14) in the image
def make_masks(image_shape = 224, patch_shape = 16, number_of_masked_patches = 10):
x = np.arange(0,(image_shape/patch_shape)**2)
zeros = np.zeros_like(x)
masked_pos = np.random.choice(np.arange((image_shape/patch_shape)**2), number_of_masked_patches, replace = False)
zeros[masked_pos.astype(np.uint8)] = 1
return zeros
masks = make_masks() # added this
inputs = feature_extractor(images=image, return_tensors="pt")
inputs["bool_masked_pos"] = torch.BoolTensor(masks.reshape(1,-1)) #added this
outputs = model(**inputs)
logits = outputs.logits
The code runs, but I don’t have a loss as output, just the logits. Will this work when I’m training the model or do I have to pass the inputs and masks in another way (so I can have an actual loss and therefore train the model)?