Multiple Mask Tokens

zanderbush · July 10, 2020, 12:34pm

For those wishing to [MASK] several tokens, here this is.

My question, however, relates to the output. I added “top_k” assuming I’d be able to return multiple sentences, but that was not the case. I am not sure how exactly I can achieve this.

import torch
from transformers import BertTokenizer, BertModel,BertForMaskedLM
 
tokenizer = BertTokenizer.from_pretrained('bert-base-cased')
 
input_tx = "[CLS] [MASK] [MASK] [MASK] of the United States mismangement of the Coronavirus is its distrust of science. [SEP]"
tokenized_text = tokenizer.tokenize(input_tx)
indexed_tokens = tokenizer.convert_tokens_to_ids(tokenized_text)
top_k = 10
tokens_tensor = torch.tensor([indexed_tokens])
segments_tensors = torch.tensor([[0]*25])
 
model = BertForMaskedLM.from_pretrained('bert-base-cased')
 
outputs = model(tokens_tensor, token_type_ids=segments_tensors)
predictions = outputs[0]
predicted_index = [torch.argmax(predictions[0, i]).item() for i in range(0,24)]
predicted_token = [tokenizer.convert_ids_to_tokens([predicted_index[x]])[0] for x in range(1,24)]
print(predicted_token)


`Output: 'The', 'main', 'cause', 'of', 'the', 'United', 'States', 'mi', '##sman', '##gement', 'of', 'the', 'Co', '##rona', '##virus', 'is', 'its', 'di', '##st', '##rust', 'of', 'science', '`

sgugger · July 10, 2020, 1:51pm

Hi there! First of all, please note that in the latest release, the recommended way to preprocess your input is just to call the tokenizer on your test:

import torch
from transformers import BertTokenizer, BertModel,BertForMaskedLM
 
tokenizer = BertTokenizer.from_pretrained('bert-base-cased')
input_txt = "[MASK] [MASK] [MASK] of the United States mismanagement of the Coronavirus is its distrust of science."
inputs = tokenizer(input_txt, return_tensors='pt')

This returns a dict string to tensors (since I asked to return pytorch tensors with the last argument) and you can then directly call your model on it:

model = BertForMaskedLM.from_pretrained('bert-base-cased')
 
outputs = model(**inputs)
predictions = outputs[0]

At this stage, predictions is the output of our language model before the softmax (we won’t care about that since the probabilities after the softmax or the activations before are in the same order). You ask for the most probable token, so it only returns that. If you want, say, the most probable 10 tokens, you could go:

sorted_preds, sorted_idx = predictions[0].sort(dim=-1, descending=True)
for k in range(10):
    predicted_index = [sorted_idx[i, k].item() for i in range(0,24)]
    predicted_token = [tokenizer.convert_ids_to_tokens([predicted_index[x]])[0] for x in range(1,24)]
    print(predicted_token)

zanderbush · July 10, 2020, 3:50pm

Thank you so much for your help. I really appreciate your in-depth explanation.

neerajak · July 18, 2021, 10:28am

Is there a way to retrieve the probabilities of the words retrieved in the multiple masks? Any help will be really appreciated.

breandan · February 12, 2022, 11:17pm

Is there a way to retrieve the probabilities of the words retrieved in the multiple masks?

Since experimental support for multi-masking was recently added to the fill-mask pipeline, retrieving the score for each token is supported, although it is still unclear clear what the correct semantics should be. If you want to use the targets parameter, this is only supported in the single-mask case. @Lysandre goes into further details in the PR discussion.

Topic		Replies	Views
Having Multiple [MASK] tokens in a sentence Beginners	2	2489	April 8, 2021
Unexpected result from transformer model prediction Beginners	0	288	November 21, 2021
Combine multiple sentences together during tokenization 🤗Tokenizers	3	5634	February 4, 2022
How do we reassemble sub tokens when running a token classification model in inference with a sentence? 🤗Transformers	2	815	January 4, 2023
Batched BertForMaskedLM inference loss issue Intermediate	0	688	February 23, 2022

Multiple Mask Tokens

Related topics