For example, in a context where the model is likely to predict the word seaplane
(which gets decomposed into two tokens), should I include a single mask or two masks in the contextual sentence?
Here is a complete example: Google Colaboratory
Below is the predicted top 6 words for a single mask (where the word seaplane
should go). Here it seems reasonable to concatenate the top two predicted vocab words, but this doesn’t seem to extend into the less probable words in the list below.
top_vocab_idxes = torch.topk(torch.softmax(single_mask_token_logits[masked_idx], dim=0), 6)
for token_id in top_vocab_idxes[1]:
print (tokenizer.decode([token_id]))
sea
plane
hangar
helicopter
lake
river
Below is result for using two masks in the contextual sentence, printing out the top 6 most likely combos for first and second masked tokens in each line.
top_vocab_idxes = torch.topk(probs, 6)
for token_id in torch.transpose(top_vocab_idxes[1], 1, 0):
print (tokenizer.decode(token_id))
sea plane
water area
mountain hangar
land dive
landing aircraft
flying field
In this particular case the top 3 most probable combos above seem like reasonable predictions for the two masked tokens given context:
double_mask_sentence = f"""When taking off in a seaplane, flying in a seaplane,
and then landing in a {tokenizer.mask_token} {tokenizer.mask_token},
remember to fashion your seat belt."""
It seems likely that I should use the second method above for my inference and possible later fine-tuning, however, I doubt this is what is done during pretraining.
Thank you for any feedback on what might be best practice here.