Best way to mask a multi-token word when using `.*ForMaskedLM` models

For example, in a context where the model is likely to predict the word seaplane (which gets decomposed into two tokens), should I include a single mask or two masks in the contextual sentence?

Here is a complete example: Google Colaboratory

Below is the predicted top 6 words for a single mask (where the word seaplane should go). Here it seems reasonable to concatenate the top two predicted vocab words, but this doesn’t seem to extend into the less probable words in the list below.

top_vocab_idxes = torch.topk(torch.softmax(single_mask_token_logits[masked_idx], dim=0), 6)
for token_id in top_vocab_idxes[1]:
    print (tokenizer.decode([token_id]))

Below is result for using two masks in the contextual sentence, printing out the top 6 most likely combos for first and second masked tokens in each line.

top_vocab_idxes = torch.topk(probs, 6)
for token_id in torch.transpose(top_vocab_idxes[1], 1, 0):
    print (tokenizer.decode(token_id))
sea plane
water area
mountain hangar
land dive
landing aircraft
flying field

In this particular case the top 3 most probable combos above seem like reasonable predictions for the two masked tokens given context:

double_mask_sentence = f"""When taking off in a seaplane, flying in a seaplane,
and then landing in a {tokenizer.mask_token} {tokenizer.mask_token},
remember to fashion your seat belt."""

It seems likely that I should use the second method above for my inference and possible later fine-tuning, however, I doubt this is what is done during pretraining.

Thank you for any feedback on what might be best practice here.

This is something of interest for me too!

This might be of help: [2009.07118] It's Not Just Size That Matters: Small Language Models Are Also Few-Shot Learners

Following up here, sorry didn’t do sooner.

My concern at the time of writing this post was that the masking of a single sub-word token (that is part of a multi-token word), may influence the predictions of the words immediately before or after, that token by ‘leaking’ the information that the predicted tokens should also be a sub-word token (to later be combined with the unmasked subword token).

It seems one way to fix this is by using ‘whole-word-masking’ as shown in this notebook, whole_word_to_mask_multitoken_words_in_*ForMaskedLM.ipynb.

However, I wasn’t so sure if this ‘leaking’ of information was really happening or not, so I just tried a slightly different experiment: explicitly replacing a single-word token_id (for the word plane) with a subword token_id (also for the word plane, but its subword variant), and then masking the token proceeding it.

I also changed the sentence so that both subword and single word tokens may be the appropriate top predictions for the masked word.

# Step 1: mask `sea` like word before single word `plane`
single_mask_mutitoken_sentence = f"""When taking off in a small seaplane,
flying in any small plane, and then landing in a {tokenizer.mask_token} plane,
remember to fashion your seat belt."""

# Step 2: swap `plane` token for `_plane` token
single_mask_mutitoken_swapped_input = torch.where(
single_mask_mutitoken_input == plane_id,
_plane_id, single_mask_mutitoken_input

And we do see the top tokens predicted to proceed a masked subword, are all themselves subword appropriate tokens. This is the case, even though the single-token word small would have been a pretty suitable prediction.

for token_id in top_vocab_idxes[1]:
print (tokenizer.decode([token_id]))


as can be seen in this notebook: possible_leaking_of_subwords_in_multitoken_words_in_*ForMaskedLM.ipynb… indicating to me that some leaking is in fact going on.

Would love other’s thoughts on this!