Where in the code does masking of tokens happen when pretraining BERT

rsvarma · August 11, 2020, 2:19am

Hi all,

I was making myself familiar with the BertForPreTraining and BertTokenizer classes, and I am unsure where in the code the masking of tokens actually happens. I have tried tracing through but am getting lost in the weeds of various tokenizer subclasses. Any help with this would be much appreciated.

sgugger · August 11, 2020, 2:53am

There is no script to pretrain BERT in the examples, transformers is primarily there to help you finetune a model like BERT on a downstream task.

That being said, the DataCollatorForLanguageModeling masks random tokens when creating batches, if you need it.

rsvarma · August 11, 2020, 3:12am

In that case, I notice that in the docs for BertForPreTraining the labels that I am supposed to pass in are the following:

labels ( torch.LongTensor of shape (batch_size, sequence_length) , optional, defaults to None ) – Labels for computing the masked language modeling loss. Indices should be in [-100, 0, ..., config.vocab_size] (see input_ids docstring) Tokens with indices set to -100 are ignored (masked), the loss is only computed for the tokens with labels in [0, ..., config.vocab_size]

Which implies that I somehow need to have knowledge of which tokens were masked, but I am currently unclear on how to get this information as I am not sure whether the tokenizer is masking tokens or if I am expected to do so myself either before or after using the tokenizer. In that sense I suppose a more specific question would be this:
Does the BertTokenizer class have the capability to prepare masked inputs for the BertForPreTraining and BertForMaskedLM classes? If so, where in the code does this occur (while tracing through the code, I have seen that it takes in a mask_token parameter, presumably to associate an input id with it? but I have not encountered any code that actually replaces tokens with masked tokens). If not, how could I go about masking tokens myself?

sgugger · August 11, 2020, 3:21am

You’re mixing masking tokens with indices -100 for loss computation and the masking of tokens. The loss ignores tokens with indices -100 because that’s how PyTorch has its default losses. You can use it to ignore the results of padded tokens.

The tokens masked are considered in the loss computation (otherwise your model isn’t learning to predict them). The masked token is [MASK] for BERT, you can replace any token you like by it.

rsvarma · August 11, 2020, 4:43am

@sgugger Thanks for your prompt responses and bearing with me

I understand that the tokens with indices -100 are used to determine which tokens are used in the computation of loss, as well as that they are used to prevent you from learning on padded tokens. However, my understanding is that in the masked language modelling task BERT only trains on masked tokens, e.g. tokens replaced with [MASK], in which case I would also want the model to ignore results from any non-masked tokens, which is why I would need to know which tokens are masked.

This leads back to my main point of confusion now: Is BertTokenizer replacing tokens with [MASK] as described in the BERT paper? I see that it has a mask_token parameter, what does it use this parameter for?

abdallah197 · August 17, 2020, 9:17pm

@leoapolonio guided me to this snippet of code, this might be helpful, what I understood is that the function mask_tokens in DataCollatorForLanguageModeling class is responsible for randomly masking the tokens.

github.com

huggingface/transformers/blob/master/src/transformers/data/data_collator.py#L157


    if are_tensors_same_length:
        return torch.stack(examples, dim=0)
    else:
        if self.tokenizer._pad_token is None:
            raise ValueError(
                "You are attempting to pad samples but the tokenizer you are using"
                f" ({self.tokenizer.__class__.__name__}) does not have one."
            )
        return pad_sequence(examples, batch_first=True, padding_value=self.tokenizer.pad_token_id)

def mask_tokens(self, inputs: torch.Tensor) -> Tuple[torch.Tensor, torch.Tensor]:
    """
    Prepare masked tokens inputs/labels for masked language modeling: 80% MASK, 10% random, 10% original.
    """

    if self.tokenizer.mask_token is None:
        raise ValueError(
            "This tokenizer does not have a mask token which is necessary for masked language modeling. Remove the --mlm flag if you want to use this tokenizer."
        )

    labels = inputs.clone()

Topic		Replies	Views
BertForMaskedLM training from scratch 🤗Transformers	0	1047	April 7, 2023
BertForMaskedLM train 🤗Transformers	2	784	January 20, 2021
Using a dataset with already masked tokens Beginners	2	702	February 3, 2021
Fine-tuning BERT with deterministic masking instead of random masking Beginners	0	165	April 22, 2024
BertForMaskedLM’s loss and scores, how the loss is computed? 🤗Transformers	13	25050	September 22, 2023

Where in the code does masking of tokens happen when pretraining BERT

Related topics