Masking Probability


I am wondering whether the masking of tokens [MASK] is done by applying the masking probability to a given sequence or the whole batch altogether.

In DataCollatorForLanguageModeling the masking is done on the tensor directly.

If I read the code correctly, it is every “token” has masking probability to be masked (makes/replaced/not changed). Independent to how many tokens you have i.e. the size of tensor.

In the paper, It’s mentioned that the masing is done on 0.15 f all WordPiece tokens in each sequence at random.
In the code though, it’s based on the inputs in

labels = inputs.clone()
probability_matrix = torch.full(labels.shape, self.mlm_probability)
masked_indices = torch.bernoulli(probability_matrix).bool()

I don’t know whether inputs refers to one sequence or one batch of sequences

inputs is batch and probability_matrix creates prob for each sequence.