Here, it says you can mask k tokens. However, in the documentation, it shows you only being able to mask one token. Is it possible to mask k words or am I mistaken?
Are you using a fill-mask pipeline? If so, there’s a hard-coded limit of a single mask in the class, even though the model itself may support multiple masks. I guess the added behavior would warrant some more functionality when it comes to choosing how to sample, e.g. if there are N masked tokens with a few top-k probabilities each one might either want to sample from the join distribution (i.e. ranking the pairs based on p1*p2) or independently. The best approach would depend on the model’s internals, I suppose.
I saw a post a while back welcoming a PR for this matter, so it’s a wanted feature.
I was using the following. However, this code does not work well if you’re aiming to consecutive tokens.
import torch
sentence = "The capital of France <mask> contains the Eiffel <mask>."
token_ids = tokenizer.encode(sentence, return_tensors='pt')
# print(token_ids)
token_ids_tk = tokenizer.tokenize(sentence, return_tensors='pt')
print(token_ids_tk)
masked_position = (token_ids.squeeze() == tokenizer.mask_token_id).nonzero()
masked_pos = [mask.item() for mask in masked_position ]
print (masked_pos)
with torch.no_grad():
output = model(token_ids)
last_hidden_state = output[0].squeeze()
print ("\n\n")
print ("sentence : ",sentence)
print ("\n")
list_of_list =[]
for mask_index in masked_pos:
mask_hidden_state = last_hidden_state[mask_index]
idx = torch.topk(mask_hidden_state, k=100, dim=0)[1]
words = [tokenizer.decode(i.item()).strip() for i in idx]
list_of_list.append(words)
print (words)
best_guess = ""
for j in list_of_list:
best_guess = best_guess+" "+j[0]
---
What I may try to do is <mask> two consecutive tokens.
(ex: Paris is <mask> <mask> to visit.)
I'll then re-insert the most probable token for the first <mask>.
(ex: Paris is a <mask> to visit.)
Then, I'll return the second <mask>'s most probable token.
(ex: Paris is a city to visit.)
This is the updated code.
import torch
sentence = "The capital of France <mask> <mask> the Eiffel Tower."
token_ids = tokenizer.encode(sentence, return_tensors='pt')
token_ids_tk = tokenizer.tokenize(sentence, return_tensors='pt')
print(token_ids_tk)
masked_position = (token_ids.squeeze() == tokenizer.mask_token_id).nonzero()
masked_pos = [mask.item() for mask in masked_position ]
print (masked_pos)
with torch.no_grad():
output = model(token_ids)
last_hidden_state = output[0].squeeze()
print ("\n\n")
print ("sentence : ",sentence)
print ("\n")
list_of_list =[]
for mask_index in masked_pos:
mask_hidden_state = last_hidden_state[mask_index]
idx = torch.topk(mask_hidden_state, k=5, dim=0)[1]
words = [tokenizer.decode(i.item()).strip() for i in idx]
list_of_list.append(words)
#print(words)
sentences3 = []
for i in range(5):
New = (list_of_list[0])
New = New[i]
New = New.replace("['", "").replace("']", "")
sentence2 = sentence.replace("<mask>", New, 1)
sentences3.append(sentence2)
for i in sentences3:
#print(i)
token_ids = tokenizer.encode(i, return_tensors='pt')
token_ids_tk = tokenizer.tokenize(i, return_tensors='pt')
#print(token_ids_tk)
masked_position = (token_ids.squeeze() == tokenizer.mask_token_id).nonzero()
masked_pos = [mask.item() for mask in masked_position ]
print (masked_pos)
with torch.no_grad():
output = model(token_ids)
last_hidden_state = output[0].squeeze()
print ("\n\n")
print ("sentence : ",i)
print ("\n")
list_of_list =[]
for mask_index in masked_pos:
mask_hidden_state = last_hidden_state[mask_index]
idx = torch.topk(mask_hidden_state, k=5, dim=0)[1]
words = [tokenizer.decode(i.item()).strip() for i in idx]
list_of_list.append(words)
print(words)
-
sentence : The capital of France <mask> <mask> the Eiffel Tower.
[6]
sentence : The capital of France , <mask> the Eiffel Tower.
['and', 'with', 'near', 'at', 'including']
[6]
sentence : The capital of France is <mask> the Eiffel Tower.
['in', 'now', 'called', 'at', 'under']
[6]
sentence : The capital of France lies <mask> the Eiffel Tower.
['atop', 'under', 'beneath', 'in', 'behind']
[6]
sentence : The capital of France stands <mask> the Eiffel Tower.
['atop', 'at', 'on', 'under', 'in']
[6]
sentence : The capital of France rests <mask> the Eiffel Tower.
['atop', 'on', 'in', 'upon', 'under']
Hey, thanks for the code.
Did you manage to solve it for consecutive masked tokens? I’m having the same problem
Cheers,
Fran
@franfram I had the same issue… Do you happen to know which model supports multiple tokens during inference?
@cnut1648
I did, but it doesn’t work that well (I’m using a Spanish model, maybe an English one will work better).
here’s a collab with the code
hope it helps
cheers
I see, thanks!!! I tried with English one and it seems that roberta for English also doesn’t work well…