VideoMAE Pretrain Batch Masking

I have a question about whether it’s possible to use multiple masks in a single micro-batch of size >1 for VideoMAE pre-training.

In the current code it seems that when inverse mask ([B, P], where P number of patches) is applied to batched embeddings ([B, P, D], where P number of patches, D number of features) we will lose the batch dimension ([P, D], all batch entries’ visible patches will be clamped together) and we will get this error when the number of patches for batch entries differ and we cannot reshape:

{RuntimeError}shape '[B, -1, D]' is invalid for input of size X

Or are we supposed to use the same mask for the micro-batch to make sure dimensions are the same?

Thank you!

Thanks for reporting this, I will take a look.

I’ve taken a look, this runs fine for me:

from transformers import VideoMAEFeatureExtractor, VideoMAEForPreTraining
import torch

model = VideoMAEForPreTraining.from_pretrained("MCG-NJU/videomae-base")

# assume the batch size is 2
num_frames = 16
pixel_values = torch.randn(2,num_frames,3,224,224)

num_patches_per_frame = (model.config.image_size // model.config.patch_size) ** 2
seq_length = (num_frames // model.config.tubelet_size) * num_patches_per_frame

bool_masked_pos_1 = torch.randint(0, 2, (1, seq_length)).bool()
bool_masked_pos_2 = torch.randint(0, 2, (1, seq_length)).bool()
bool_masked_pos = torch.cat([bool_masked_pos_1, bool_masked_pos_2])

outputs = model(pixel_values, bool_masked_pos=bool_masked_pos)

loss = outputs.loss

Thank you for the code! I executed it on my environment and it still says:

RuntimeError: shape '[2, -1, 768]' is invalid for input of size 1209600

Pytorch version 1.9.0 and transformers version 4.22.1.

Hi, as far as I understand there’s a chance that these two masks could end up being of the same masked/unmasked ratio. Which is why it may or may not work based on chance.

I was able to make this work for batches using this snippet:

# e.g. mask_ratio = 0.9
bool_masked_pos = np.ones(seq_length)
mask_num = math.ceil(seq_length * mask_ratio)
mask = np.random.choice(seq_length, mask_num, replace=False)
bool_masked_pos[mask] = 0

Basically making sure we mask the same number of pixels across all images.

Oh yes that’s indeed required. The mask ratio needs to be the same for each video in a batch.

Note that the VideoMAE authors trained the model like that. The code snippet is also used in the original implementation.

Thank you for the solution @aljipa! I guess bool_masked_pos needs to be casted to boolean and stack “batch size” bool_masked_pos elements, i.e., following your code snippet:

batch_size = 2

bool_masked_pos = np.ones(seq_length)
mask_num = math.ceil(seq_length * mask_ratio)
mask = np.random.choice(seq_length, mask_num, replace=False)
bool_masked_pos[mask] = 0

# Torch and bool cast, extra dimension added for concatenation
bool_masked_pos = torch.as_tensor(bool_masked_pos).bool().unsqueeze(0)
bool_masked_pos = torch.cat([bool_masked_pos for _ in range(batch_size)])

So that for all the elements within the batch, the mask is the same as mentioned by @nielsr.

No that’s not necessary, I said that the mask ratio needs to be the same (like 0.9), not necessarily the mask itself. The boolean mask can differ between examples in a batch.