I’m trying to pretrain BEiT using some images I have, but I’m not sure how to pass the masked images and their originals to the algorithm. As I understand,
pixel_values is where I feed the original image, and
bool_masked_pos where I say which patches of the image were masked. So I used the example code just to see if it works.
from transformers import BeitFeatureExtractor, BeitForMaskedImageModeling from PIL import Image import requests url = "http://images.cocodataset.org/val2017/000000039769.jpg" image = Image.open(requests.get(url, stream=True).raw) feature_extractor = BeitFeatureExtractor.from_pretrained("microsoft/beit-base-patch16-224-pt22k") model = BeitForMaskedImageModeling.from_pretrained("microsoft/beit-base-patch16-224-pt22k") # function to make the mask arrays. I'll mask 10 patches of the 14*14 existing patches (224/16 = 14) in the image def make_masks(image_shape = 224, patch_shape = 16, number_of_masked_patches = 10): x = np.arange(0,(image_shape/patch_shape)**2) zeros = np.zeros_like(x) masked_pos = np.random.choice(np.arange((image_shape/patch_shape)**2), number_of_masked_patches, replace = False) zeros[masked_pos.astype(np.uint8)] = 1 return zeros masks = make_masks() # added this inputs = feature_extractor(images=image, return_tensors="pt") inputs["bool_masked_pos"] = torch.BoolTensor(masks.reshape(1,-1)) #added this outputs = model(**inputs) logits = outputs.logits
The code runs, but I don’t have a loss as output, just the logits. Will this work when I’m training the model or do I have to pass the inputs and masks in another way (so I can have an actual loss and therefore train the model)?