Mask only specific words

What would be the best strategy to mask only specific words during the LM training?
My aim is to mask only words of interest which I have previously collected in a list.

The issue arises since the tokenizer, not only splits a single word in multiple tokens, but it also adds special characters if the word does not occur at the begging of a sentence.

The word “Valkyria”:

  • at the beginning of a sentences gets split as [‘V’, ‘alky’, ‘ria’] with corresponding IDs: [846, 44068, 6374].
  • while in the middle of a sentence as [‘ĠV’, ‘alky’, ‘ria’] with corresponding IDs: [468, 44068, 6374],

This is just one of the issues forcing me to have multiple entries in my list of to-be-filtered IDs.

I have already had a look at the mask_tokens() function into the DataCollatorForLanguageModeling class, which is the function actually masking the tokens during each batch, but I cannot find any efficient and smart way to mask only specific words and their corresponding IDs.

When using the “fast” variant of the tokenizers available in huggingface/transformers, whenever you encode some text, you get back a BatchEncoding.
This BatchEncoding provides some helpful mappings that we can use in this kind of situation. So, you should be able to:

  1. Find the word associated with any token using token_to_word. This method returns the index of the word in the input sequence.
  2. Once you know the word’s index, you can actually retrieve its span with word_to_chars. This will let you extract the word from the input sequence.

Hi @Anthony,

Thank you for your prompt reply!
The approach you proposed, if I’m not mistaken, would be helpful in permanently masking words when reading the dataset.

Instead, I am interested in dynamically masking words at batch time.
To be more clear, I would like to implement (efficiently) a mask_tokens() function (as the one defined in the DataCollatorForLanguageModeling class) which masks only IDs corresponding to words provided in a specific list.
These IDs would not be masked during each batch but following some stochastic strategy, unlike the approach which would mask them when the dataset is read at the beginning.

I am wondering whether there is an efficient and neat way to do so, exploiting the Huggingface functions and taking also into account that some words get split in multiple IDs depending on the tokenizer at hand.