Hi

I am wondering whether the masking of tokens [MASK] is done by applying the masking probability to a given sequence or the whole batch altogether.

Hi

I am wondering whether the masking of tokens [MASK] is done by applying the masking probability to a given sequence or the whole batch altogether.

If I read the code correctly, it is every â€śtokenâ€ť has masking probability to be masked (makes/replaced/not changed). Independent to how many tokens you have i.e. the size of tensor.

In the paper, Itâ€™s mentioned that the masing is done on 0.15 f all WordPiece tokens in each sequence at random.

In the code though, itâ€™s based on the **inputs** in

```
labels = inputs.clone()
probability_matrix = torch.full(labels.shape, self.mlm_probability)
masked_indices = torch.bernoulli(probability_matrix).bool()
```

I donâ€™t know whether inputs refers to one sequence or one batch of sequences

`inputs`

is batch and `probability_matrix`

creates prob for each sequence.