Masking Probability

abdallah197 · August 18, 2020, 9:19am

Hi

I am wondering whether the masking of tokens [MASK] is done by applying the masking probability to a given sequence or the whole batch altogether.

sgugger · August 18, 2020, 12:21pm

In DataCollatorForLanguageModeling the masking is done on the tensor directly.

RichardWang · August 18, 2020, 12:49pm

If I read the code correctly, it is every “token” has masking probability to be masked (makes/replaced/not changed). Independent to how many tokens you have i.e. the size of tensor.

abdallah197 · August 19, 2020, 2:44pm

In the paper, It’s mentioned that the masing is done on 0.15 f all WordPiece tokens in each sequence at random.
In the code though, it’s based on the inputs in

labels = inputs.clone()
probability_matrix = torch.full(labels.shape, self.mlm_probability)
masked_indices = torch.bernoulli(probability_matrix).bool()

I don’t know whether inputs refers to one sequence or one batch of sequences

valhalla · August 20, 2020, 3:28pm

inputs is batch and probability_matrix creates prob for each sequence.

Topic		Replies	Views
For BERT LMs ... are the random tasks created on just the first sentence or the second as well? 🤗Transformers	1	247	July 11, 2021
Unmasker probabilities for all tokens in sequence 🤗Transformers	0	223	December 23, 2022
Where in the code does masking of tokens happen when pretraining BERT Beginners	5	7268	August 17, 2020
Masking specific token in each input sentence during Masked language modelling 🤗Transformers	0	1041	October 18, 2021
Documentation: Transformers Language Modeling Section Beginners	0	325	May 14, 2022

Masking Probability

Related topics