Best way to mask a multi-token word when using `.*ForMaskedLM` models

For example, in a context where the model is likely to predict the word seaplane (which gets decomposed into two tokens), should I include a single mask or two masks in the contextual sentence?

Here is a complete example: Google Colaboratory

Below is the predicted top 6 words for a single mask (where the word seaplane should go). Here it seems reasonable to concatenate the top two predicted vocab words, but this doesn’t seem to extend into the less probable words in the list below.

top_vocab_idxes = torch.topk(torch.softmax(single_mask_token_logits[masked_idx], dim=0), 6)
for token_id in top_vocab_idxes[1]:
    print (tokenizer.decode([token_id]))

Below is result for using two masks in the contextual sentence, printing out the top 6 most likely combos for first and second masked tokens in each line.

top_vocab_idxes = torch.topk(probs, 6)
for token_id in torch.transpose(top_vocab_idxes[1], 1, 0):
    print (tokenizer.decode(token_id))
sea plane
water area
mountain hangar
land dive
landing aircraft
flying field

In this particular case the top 3 most probable combos above seem like reasonable predictions for the two masked tokens given context:

double_mask_sentence = f"""When taking off in a seaplane, flying in a seaplane,
and then landing in a {tokenizer.mask_token} {tokenizer.mask_token},
remember to fashion your seat belt."""

It seems likely that I should use the second method above for my inference and possible later fine-tuning, however, I doubt this is what is done during pretraining.

Thank you for any feedback on what might be best practice here.

This is something of interest for me too!

This might be of help: [2009.07118] It's Not Just Size That Matters: Small Language Models Are Also Few-Shot Learners