Mask only specific words

Gabrer · July 10, 2020, 8:47am

What would be the best strategy to mask only specific words during the LM training?
My aim is to mask only words of interest which I have previously collected in a list.

The issue arises since the tokenizer, not only splits a single word in multiple tokens, but it also adds special characters if the word does not occur at the begging of a sentence.
E.g.:

The word “Valkyria”:

at the beginning of a sentences gets split as [‘V’, ‘alky’, ‘ria’] with corresponding IDs: [846, 44068, 6374].

while in the middle of a sentence as [‘ĠV’, ‘alky’, ‘ria’] with corresponding IDs: [468, 44068, 6374],

This is just one of the issues forcing me to have multiple entries in my list of to-be-filtered IDs.

I have already had a look at the mask_tokens() function into the DataCollatorForLanguageModeling class, which is the function actually masking the tokens during each batch, but I cannot find any efficient and smart way to mask only specific words and their corresponding IDs.

anthony · July 10, 2020, 3:38pm

When using the “fast” variant of the tokenizers available in huggingface/transformers, whenever you encode some text, you get back a BatchEncoding.
This BatchEncoding provides some helpful mappings that we can use in this kind of situation. So, you should be able to:

Find the word associated with any token using token_to_word. This method returns the index of the word in the input sequence.
Once you know the word’s index, you can actually retrieve its span with word_to_chars. This will let you extract the word from the input sequence.

Gabrer · July 11, 2020, 10:31am

Hi @Anthony,

Thank you for your prompt reply!
The approach you proposed, if I’m not mistaken, would be helpful in permanently masking words when reading the dataset.

Instead, I am interested in dynamically masking words at batch time.
To be more clear, I would like to implement (efficiently) a mask_tokens() function (as the one defined in the DataCollatorForLanguageModeling class) which masks only IDs corresponding to words provided in a specific list.
These IDs would not be masked during each batch but following some stochastic strategy, unlike the approach which would mask them when the dataset is read at the beginning.

I am wondering whether there is an efficient and neat way to do so, exploiting the Huggingface functions and taking also into account that some words get split in multiple IDs depending on the tokenizer at hand.

RylanSchaeffer · November 6, 2021, 7:30pm

@Gabrer did you find a solution?

BramVanroy · November 7, 2021, 12:30pm

You can create your own custom mask and merge that with the automatic mask.

tokenizer = AutoTokenizer.from_pretrained("roberta-base")
word = "Valkyria"
sentences = ["Valkyria at the beginning of sentence", "And here Valkyria in the middle of one."]
# Index of word in the sentences (word-tokenized!)
word_idxs_in_sent = [sent.split(" ").index(word) for sent in sentences]

encoded = tokenizer(sentences, return_tensors="pt", padding=True)
print("Original mask", encoded["attention_mask"])

# For each sentence, set a subword token to False if it belongs to the word (becomes 0 in LongTensor)
match_idxs = torch.LongTensor([[wid != word_idxs_in_sent[batch_idx] for wid in encoded.word_ids(batch_idx)]
              for batch_idx in range(len(sentences))])
print("Subword indices of matching word", match_idxs)

# Merge: if a word is zero in our custom match, merge, if not, use the original mask
# This ensures that we mask the word IDs but keep the original mask for special tokens (cls, pad, etc.)
merged = torch.where(match_idxs == 0, match_idxs, encoded["attention_mask"])
print("Merged mask", merged)

Results:

original mask tensor([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0],
        [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]])
Subword indices of matching word tensor([[1, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1],
        [1, 1, 1, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1]])
Merged mask tensor([[1, 0, 0, 0, 1, 1, 1, 1, 1, 1, 0, 0, 0],
        [1, 1, 1, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1]])

Topic		Replies	Views
Masking specific token in each input sentence during Masked language modelling 🤗Transformers	0	910	October 18, 2021
Mask modelling on specific words Beginners	1	930	March 25, 2021
Whole-word masking for T5 Beginners	2	233	November 28, 2023
Selective masking in Language modeling Beginners	1	1892	August 13, 2020
Dynamic decoder token masking 🤗Transformers	0	178	February 13, 2023

Mask only specific words

Related Topics