Mask only specific words

What would be the best strategy to mask only specific words during the LM training?
My aim is to mask only words of interest which I have previously collected in a list.

The issue arises since the tokenizer, not only splits a single word in multiple tokens, but it also adds special characters if the word does not occur at the begging of a sentence.
E.g.:

The word “Valkyria”:

  • at the beginning of a sentences gets split as [‘V’, ‘alky’, ‘ria’] with corresponding IDs: [846, 44068, 6374].
  • while in the middle of a sentence as [‘ĠV’, ‘alky’, ‘ria’] with corresponding IDs: [468, 44068, 6374],

This is just one of the issues forcing me to have multiple entries in my list of to-be-filtered IDs.

I have already had a look at the mask_tokens() function into the DataCollatorForLanguageModeling class, which is the function actually masking the tokens during each batch, but I cannot find any efficient and smart way to mask only specific words and their corresponding IDs.

5 Likes

When using the “fast” variant of the tokenizers available in huggingface/transformers, whenever you encode some text, you get back a BatchEncoding.
This BatchEncoding provides some helpful mappings that we can use in this kind of situation. So, you should be able to:

  1. Find the word associated with any token using token_to_word. This method returns the index of the word in the input sequence.
  2. Once you know the word’s index, you can actually retrieve its span with word_to_chars. This will let you extract the word from the input sequence.
1 Like

Hi @Anthony,

Thank you for your prompt reply!
The approach you proposed, if I’m not mistaken, would be helpful in permanently masking words when reading the dataset.

Instead, I am interested in dynamically masking words at batch time.
To be more clear, I would like to implement (efficiently) a mask_tokens() function (as the one defined in the DataCollatorForLanguageModeling class) which masks only IDs corresponding to words provided in a specific list.
These IDs would not be masked during each batch but following some stochastic strategy, unlike the approach which would mask them when the dataset is read at the beginning.

I am wondering whether there is an efficient and neat way to do so, exploiting the Huggingface functions and taking also into account that some words get split in multiple IDs depending on the tokenizer at hand.

1 Like

@Gabrer did you find a solution?

You can create your own custom mask and merge that with the automatic mask.

tokenizer = AutoTokenizer.from_pretrained("roberta-base")
word = "Valkyria"
sentences = ["Valkyria at the beginning of sentence", "And here Valkyria in the middle of one."]
# Index of word in the sentences (word-tokenized!)
word_idxs_in_sent = [sent.split(" ").index(word) for sent in sentences]

encoded = tokenizer(sentences, return_tensors="pt", padding=True)
print("Original mask", encoded["attention_mask"])

# For each sentence, set a subword token to False if it belongs to the word (becomes 0 in LongTensor)
match_idxs = torch.LongTensor([[wid != word_idxs_in_sent[batch_idx] for wid in encoded.word_ids(batch_idx)]
              for batch_idx in range(len(sentences))])
print("Subword indices of matching word", match_idxs)

# Merge: if a word is zero in our custom match, merge, if not, use the original mask
# This ensures that we mask the word IDs but keep the original mask for special tokens (cls, pad, etc.)
merged = torch.where(match_idxs == 0, match_idxs, encoded["attention_mask"])
print("Merged mask", merged)

Results:

original mask tensor([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0],
        [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]])
Subword indices of matching word tensor([[1, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1],
        [1, 1, 1, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1]])
Merged mask tensor([[1, 0, 0, 0, 1, 1, 1, 1, 1, 1, 0, 0, 0],
        [1, 1, 1, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1]])
2 Likes