Assuming that I am using one of the BERT instances for a language modeling task
from transformers import BertForMaskedLM, BertTokenizerFast
model = BertForMaskedLM.from_pretrained('bert-base-uncased')
tokenizer = BertTokenizerFast.from_pretrained('bert-base-uncased')
now let’s say that I have the next sentence
text = "Jeremy Bentham was the founder of modern utilitarianism"
inputs = tokenizer(text, return_tensors="pt", add_special_tokens = True, truncation=True, max_length=64)
If I say to mask a random token from some part of the sentence, for this example let’s put an arbitrary example of the 7th token
inputs['input_ids'] = tokenizer.mask_token_id
and then pass this to the model as
outputs = model(inputs)
scores = outputs
If I want the embedding (vector) of only the masked token, how should I access it?
Your scores have a shape
[batch_size, seq_len, vocab_size] so
scores (the indices where you masked) should have the predictions for the masked token.
@sgugger In this example, I already know the index because it was chosen arbitrarily before passing the input to the model. I am asking about cases where I don’t know the exact index of the masked token and I only receive the input with a masked random token.
What I had in mind is to search for the masked token in the input before getting passed to the model and save it in order to access the masked token prediction later. But this will take linear runtime for each input example.
If you don’t save the places you randomly masked, you have no other choices though. Getting the location of the masked token will be quick in any case, compared to going through the model, as long as you use pytorch functions for it.