Multiple Mask Tokens

For those wishing to [MASK] several tokens, here this is.

My question, however, relates to the output. I added “top_k” assuming I’d be able to return multiple sentences, but that was not the case. I am not sure how exactly I can achieve this.

import torch
from transformers import BertTokenizer, BertModel,BertForMaskedLM
 
tokenizer = BertTokenizer.from_pretrained('bert-base-cased')
 
input_tx = "[CLS] [MASK] [MASK] [MASK] of the United States mismangement of the Coronavirus is its distrust of science. [SEP]"
tokenized_text = tokenizer.tokenize(input_tx)
indexed_tokens = tokenizer.convert_tokens_to_ids(tokenized_text)
top_k = 10
tokens_tensor = torch.tensor([indexed_tokens])
segments_tensors = torch.tensor([[0]*25])
 
model = BertForMaskedLM.from_pretrained('bert-base-cased')
 
outputs = model(tokens_tensor, token_type_ids=segments_tensors)
predictions = outputs[0]
predicted_index = [torch.argmax(predictions[0, i]).item() for i in range(0,24)]
predicted_token = [tokenizer.convert_ids_to_tokens([predicted_index[x]])[0] for x in range(1,24)]
print(predicted_token)


`Output: 'The', 'main', 'cause', 'of', 'the', 'United', 'States', 'mi', '##sman', '##gement', 'of', 'the', 'Co', '##rona', '##virus', 'is', 'its', 'di', '##st', '##rust', 'of', 'science', '`

Hi there! First of all, please note that in the latest release, the recommended way to preprocess your input is just to call the tokenizer on your test:

import torch
from transformers import BertTokenizer, BertModel,BertForMaskedLM
 
tokenizer = BertTokenizer.from_pretrained('bert-base-cased')
input_txt = "[MASK] [MASK] [MASK] of the United States mismanagement of the Coronavirus is its distrust of science."
inputs = tokenizer(input_txt, return_tensors='pt')

This returns a dict string to tensors (since I asked to return pytorch tensors with the last argument) and you can then directly call your model on it:

model = BertForMaskedLM.from_pretrained('bert-base-cased')
 
outputs = model(**inputs)
predictions = outputs[0]

At this stage, predictions is the output of our language model before the softmax (we won’t care about that since the probabilities after the softmax or the activations before are in the same order). You ask for the most probable token, so it only returns that. If you want, say, the most probable 10 tokens, you could go:

sorted_preds, sorted_idx = predictions[0].sort(dim=-1, descending=True)
for k in range(10):
    predicted_index = [sorted_idx[i, k].item() for i in range(0,24)]
    predicted_token = [tokenizer.convert_ids_to_tokens([predicted_index[x]])[0] for x in range(1,24)]
    print(predicted_token)
3 Likes

Thank you so much for your help. I really appreciate your in-depth explanation.

1 Like

Is there a way to retrieve the probabilities of the words retrieved in the multiple masks? Any help will be really appreciated.

Is there a way to retrieve the probabilities of the words retrieved in the multiple masks?

Since experimental support for multi-masking was recently added to the fill-mask pipeline, retrieving the score for each token is supported, although it is still unclear clear what the correct semantics should be. If you want to use the targets parameter, this is only supported in the single-mask case. @Lysandre goes into further details in the PR discussion.