# Bert attention mask question

Assume I have this code snippet:

``````from transformers import BertForMaskedLM, BertTokenizerFast
tokenizer = BertTokenizerFast.from_pretrained('bert-base-uncased')
model = BertForMaskedLM.from_pretrained('bert-base-uncased')

inputs = tokenizer([
"The capital of",
"yes yes"], return_tensors="pt", padding=True)

pred = model(**inputs)
``````

We get a attention_mask and it looks like:

``````> inputs['attention_mask']
tensor([[1, 1, 1, 1, 1],
[1, 1, 1, 1, 0]])
``````

Then we got into BertForMaskedLM forward and until BertSelfAttention (modeling_bert.py:350)

``````if attention_mask is not None:
# Apply the attention mask is (precomputed for all layers in BertModel forward() function)
attention_scores = attention_scores + attention_mask

# Normalize the attention scores to probabilities.
attention_probs = nn.functional.softmax(attention_scores, dim=-1)
``````

Here `attention_mask.shape` is `torch.Size([2, 1, 1, 5])`, and value is:

``````tensor([[[[-0.0000e+00, -0.0000e+00, -0.0000e+00, -0.0000e+00, -0.0000e+00]]],
[[[-0.0000e+00, -0.0000e+00, -0.0000e+00, -0.0000e+00, -3.4028e+38]]]])
``````

and `attention_probs.shape` is `torch.Size([2, 12, 5, 5])`, we see the second example:

``````# print(attention_probs[1][0])
tensor([[0.1480, 0.1129, 0.1008, 0.6383, 0.0000],
[0.2509, 0.1860, 0.1909, 0.3722, 0.0000],
[0.2001, 0.2207, 0.2237, 0.3555, 0.0000],
[0.3047, 0.1777, 0.1707, 0.3469, 0.0000],
[0.0713, 0.3388, 0.3918, 0.1981, 0.0000]])
``````

The last col is 0, thatâ€™s what we expect. but shouldnâ€™t the last line be 0ï¼Ÿ
Token length is 4, it is not right to compute the attention score of an out-of-range token in query with each token in the key. I hope someone can help me to explain this.

I have the same problem. Have you figured it out?

in below statemnet(x_5 means shape length is 5):
`context_layer = torch.matmul(attention_probs[batch_2, head_15, x_5, x_5], value_layer[batch_2, head_15, x_5, h_64])` from model_bert.py

in this occasion, torch.matmul equal torch.bmm, so above expression can reslove into
prob[ x_5, x_5] * value[x_5, h_64]

according to 2d matrix multiply definition, letâ€™s look at probsâ€™s row and valueâ€™s col

Since valueâ€™s line vector is treated as tokenâ€™s embeddings, for convenience we can just consider value as probs[0, x_5](the first factorâ€™s one row) * value[x_5,0](the second factorâ€™s one col)

then the problem transform into
`prob[ x_5, x_5] * value[x_5]` or just `prob[ 0, x_5] * value[x_5]` in which the probs degrade into one dimension vector(q*k are also one dimension)

follow self attention formula, you know that probsâ€™s line(which will be product with valueâ€™s column), in other words, the scores line(len=5), means weights to every values token(total 5 token), sine the values last token is padding, so the weight on it(which is correspond to every lineâ€™s last column) should be zero(this is, scores/probs every lineâ€™s last element should be zero)

since every score lineâ€™â€˜s last column should be zero, the score matrixâ€™'s last column should be zero vector.

Another perspective, scores/probs line should sum to 1, but scores/probs column has no restrict.
`attention_probs = nn.Softmax(dim=-1)(attention_scores)`, softamx(dim=-1) means sum() on dim(-2) or dim(0) equals to 1.

May this help you understand

their is another insterseting observationï¼š

`context_layer = torch.matmul(attention_probs, value_layer)` from model_bert.py

as mentioned aboveï¼Œthink about `attention_layer_output = prob[ x_5, x_5] * value[x_5]`, sin probâ€™s cloumn for paddingâ€™s index is zero vector, but probâ€™s line( for paddingâ€™s index) have non-zero element in generally. Think probs as a pairwise interaction matrix (but not symmetric ).

then according 2d matrix mutiply formula(c_i_j = a_line * b_col ), then we know c_I_j wonâ€™'t be zero generally. In result, the matrix c should not have zero element, this means attention_layer_output has non zero value for padding tokenâ€™s embedding row.

as consequence, youâ€™ll see the phenomenon that padding tokenâ€™s in bert hidden tensor, change to some vector of non-zero(at the same tensor line, dimension=-2).

Thatâ€™s why people says itâ€™'s not recommended to try to correlate the input token with bert output (cls or hidden ) by index or somthing, to extract per word embedding from bertâ€™s output(which usually thought to be a sentence embedding). you can do this, but it may not act as your expect.

or you may think this hydrid bert output(hidden) as contextual token embeddings, which introduce other tokenâ€™s information by attention softmax aggregation[[as my later ansers says, information from padding token is totally not used in this procedure]? Then the attempt to mapping bert input token index to bert output index (at -2 dimension) should make sense.

another finding: which my solve your probelm, and answers why attention map could block padding token embedding during the generate of qkv, and during two attention layer:

In a word, after use attention mask(with zero score contribution on padding token when q query from key kâ€™s padding tokenâ€™s column, by q*K), the original input 512 * 768 and the attention output 512 * 712, all keep their padding token related elements with the previous padding token related row
elements. Attention Layer not change padding tokenâ€™s one-one mapping(not so precisely).

The same happened in Add and LayerNorm and FFN layer, not changeâ€™â€˜s padding tokenâ€™'s scope of influence[this is, not change(not add or not del) their contribution in loss(in gradient)].

Lastly, the output of bert model( cls, hidden), cls(1768) is the first line of bert output hidden(512768), as you thought, all elements in hidden(512*768) that related to padding token embedding is the last two row[-2:,768], and cls is not related to padding token embedding, so the following pretrain head (such as MLM or NSP), which only use cls as input( and the original embedding as this link says, but letâ€™s ignore this temporarily), their task loss will not related to padding token embeddings. Thatâ€™s it!