Hi
I am wondering whether the masking of tokens [MASK] is done by applying the masking probability to a given sequence or the whole batch altogether.
Hi
I am wondering whether the masking of tokens [MASK] is done by applying the masking probability to a given sequence or the whole batch altogether.
In DataCollatorForLanguageModeling
the masking is done on the tensor directly.
If I read the code correctly, it is every “token” has masking probability to be masked (makes/replaced/not changed). Independent to how many tokens you have i.e. the size of tensor.
In the paper, It’s mentioned that the masing is done on 0.15 f all WordPiece tokens in each sequence at random.
In the code though, it’s based on the inputs in
labels = inputs.clone()
probability_matrix = torch.full(labels.shape, self.mlm_probability)
masked_indices = torch.bernoulli(probability_matrix).bool()
I don’t know whether inputs refers to one sequence or one batch of sequences
inputs
is batch and probability_matrix
creates prob for each sequence.