Where in the code does masking of tokens happen when pretraining BERT

Hi all,

I was making myself familiar with the BertForPreTraining and BertTokenizer classes, and I am unsure where in the code the masking of tokens actually happens. I have tried tracing through but am getting lost in the weeds of various tokenizer subclasses. Any help with this would be much appreciated.

1 Like

There is no script to pretrain BERT in the examples, transformers is primarily there to help you finetune a model like BERT on a downstream task.

That being said, the DataCollatorForLanguageModeling masks random tokens when creating batches, if you need it.

In that case, I notice that in the docs for BertForPreTraining the labels that I am supposed to pass in are the following:

labels ( torch.LongTensor of shape (batch_size, sequence_length) , optional, defaults to None ) – Labels for computing the masked language modeling loss. Indices should be in [-100, 0, ..., config.vocab_size] (see input_ids docstring) Tokens with indices set to -100 are ignored (masked), the loss is only computed for the tokens with labels in [0, ..., config.vocab_size]

Which implies that I somehow need to have knowledge of which tokens were masked, but I am currently unclear on how to get this information as I am not sure whether the tokenizer is masking tokens or if I am expected to do so myself either before or after using the tokenizer. In that sense I suppose a more specific question would be this:
Does the BertTokenizer class have the capability to prepare masked inputs for the BertForPreTraining and BertForMaskedLM classes? If so, where in the code does this occur (while tracing through the code, I have seen that it takes in a mask_token parameter, presumably to associate an input id with it? but I have not encountered any code that actually replaces tokens with masked tokens). If not, how could I go about masking tokens myself?

You’re mixing masking tokens with indices -100 for loss computation and the masking of tokens. The loss ignores tokens with indices -100 because that’s how PyTorch has its default losses. You can use it to ignore the results of padded tokens.

The tokens masked are considered in the loss computation (otherwise your model isn’t learning to predict them). The masked token is [MASK] for BERT, you can replace any token you like by it.

@sgugger Thanks for your prompt responses and bearing with me

I understand that the tokens with indices -100 are used to determine which tokens are used in the computation of loss, as well as that they are used to prevent you from learning on padded tokens. However, my understanding is that in the masked language modelling task BERT only trains on masked tokens, e.g. tokens replaced with [MASK], in which case I would also want the model to ignore results from any non-masked tokens, which is why I would need to know which tokens are masked.

This leads back to my main point of confusion now: Is BertTokenizer replacing tokens with [MASK] as described in the BERT paper? I see that it has a mask_token parameter, what does it use this parameter for?

@leoapolonio guided me to this snippet of code, this might be helpful, what I understood is that the function mask_tokens in DataCollatorForLanguageModeling class is responsible for randomly masking the tokens.

1 Like