I have a question about whether it’s possible to use multiple masks in a single micro-batch of size >1 for VideoMAE pre-training.
In the current code it seems that when inverse mask ([B, P], where P number of patches) is applied to batched embeddings ([B, P, D], where P number of patches, D number of features) we will lose the batch dimension ([P, D], all batch entries’ visible patches will be clamped together) and we will get this error when the number of patches for batch entries differ and we cannot reshape:
{RuntimeError}shape '[B, -1, D]' is invalid for input of size X
Or are we supposed to use the same mask for the micro-batch to make sure dimensions are the same?
Hi, as far as I understand there’s a chance that these two masks could end up being of the same masked/unmasked ratio. Which is why it may or may not work based on chance.
Thank you for the solution @aljipa! I guess bool_masked_pos needs to be casted to boolean and stack “batch size” bool_masked_pos elements, i.e., following your code snippet:
batch_size = 2
bool_masked_pos = np.ones(seq_length)
mask_num = math.ceil(seq_length * mask_ratio)
mask = np.random.choice(seq_length, mask_num, replace=False)
bool_masked_pos[mask] = 0
# Torch and bool cast, extra dimension added for concatenation
bool_masked_pos = torch.as_tensor(bool_masked_pos).bool().unsqueeze(0)
bool_masked_pos = torch.cat([bool_masked_pos for _ in range(batch_size)])
So that for all the elements within the batch, the mask is the same as mentioned by @nielsr.
No that’s not necessary, I said that the mask ratio needs to be the same (like 0.9), not necessarily the mask itself. The boolean mask can differ between examples in a batch.