Assume I have this code snippet:
from transformers import BertForMaskedLM, BertTokenizerFast
tokenizer = BertTokenizerFast.from_pretrained('bert-base-uncased')
model = BertForMaskedLM.from_pretrained('bert-base-uncased')
inputs = tokenizer([
"The capital of",
"yes yes"], return_tensors="pt", padding=True)
pred = model(**inputs)
We get a attention_mask and it looks like:
> inputs['attention_mask']
tensor([[1, 1, 1, 1, 1],
[1, 1, 1, 1, 0]])
Then we got into BertForMaskedLM forward and until BertSelfAttention (modeling_bert.py:350)
if attention_mask is not None:
# Apply the attention mask is (precomputed for all layers in BertModel forward() function)
attention_scores = attention_scores + attention_mask
# Normalize the attention scores to probabilities.
attention_probs = nn.functional.softmax(attention_scores, dim=-1)
Here attention_mask.shape
is torch.Size([2, 1, 1, 5])
, and value is:
tensor([[[[-0.0000e+00, -0.0000e+00, -0.0000e+00, -0.0000e+00, -0.0000e+00]]],
[[[-0.0000e+00, -0.0000e+00, -0.0000e+00, -0.0000e+00, -3.4028e+38]]]])
and attention_probs.shape
is torch.Size([2, 12, 5, 5])
, we see the second example:
# print(attention_probs[1][0])
tensor([[0.1480, 0.1129, 0.1008, 0.6383, 0.0000],
[0.2509, 0.1860, 0.1909, 0.3722, 0.0000],
[0.2001, 0.2207, 0.2237, 0.3555, 0.0000],
[0.3047, 0.1777, 0.1707, 0.3469, 0.0000],
[0.0713, 0.3388, 0.3918, 0.1981, 0.0000]])
The last col is 0, that’s what we expect. but shouldn’t the last line be 0?
Token length is 4, it is not right to compute the attention score of an out-of-range token in query with each token in the key. I hope someone can help me to explain this.