Bert attention mask question

Assume I have this code snippet:

from transformers import BertForMaskedLM, BertTokenizerFast
tokenizer = BertTokenizerFast.from_pretrained('bert-base-uncased')
model = BertForMaskedLM.from_pretrained('bert-base-uncased')

inputs = tokenizer([
    "The capital of",
    "yes yes"], return_tensors="pt", padding=True)

pred = model(**inputs)

We get a attention_mask and it looks like:

> inputs['attention_mask']
tensor([[1, 1, 1, 1, 1],
        [1, 1, 1, 1, 0]])

Then we got into BertForMaskedLM forward and until BertSelfAttention (

if attention_mask is not None:
    # Apply the attention mask is (precomputed for all layers in BertModel forward() function)
    attention_scores = attention_scores + attention_mask

# Normalize the attention scores to probabilities.
attention_probs = nn.functional.softmax(attention_scores, dim=-1)

Here attention_mask.shape is torch.Size([2, 1, 1, 5]), and value is:

tensor([[[[-0.0000e+00, -0.0000e+00, -0.0000e+00, -0.0000e+00, -0.0000e+00]]],
        [[[-0.0000e+00, -0.0000e+00, -0.0000e+00, -0.0000e+00, -3.4028e+38]]]])

and attention_probs.shape is torch.Size([2, 12, 5, 5]), we see the second example:

# print(attention_probs[1][0])
tensor([[0.1480, 0.1129, 0.1008, 0.6383, 0.0000],
        [0.2509, 0.1860, 0.1909, 0.3722, 0.0000],
        [0.2001, 0.2207, 0.2237, 0.3555, 0.0000],
        [0.3047, 0.1777, 0.1707, 0.3469, 0.0000],
        [0.0713, 0.3388, 0.3918, 0.1981, 0.0000]])

The last col is 0, that’s what we expect. but shouldn’t the last line be 0?
Token length is 4, it is not right to compute the attention score of an out-of-range token in query with each token in the key. I hope someone can help me to explain this.

I have the same problem. Have you figured it out? :grinning:

in below statemnet(x_5 means shape length is 5):
context_layer = torch.matmul(attention_probs[batch_2, head_15, x_5, x_5], value_layer[batch_2, head_15, x_5, h_64]) from

in this occasion, torch.matmul equal torch.bmm, so above expression can reslove into
prob[ x_5, x_5] * value[x_5, h_64]

according to 2d matrix multiply definition, let’s look at probs’s row and value’s col

Since value’s line vector is treated as token’s embeddings, for convenience we can just consider value as probs[0, x_5](the first factor’s one row) * value[x_5,0](the second factor’s one col)

then the problem transform into
prob[ x_5, x_5] * value[x_5] or just prob[ 0, x_5] * value[x_5] in which the probs degrade into one dimension vector(q*k are also one dimension)

Weixin Screenshot_20240311163902

follow self attention formula, you know that probs’s line(which will be product with value’s column), in other words, the scores line(len=5), means weights to every values token(total 5 token), sine the values last token is padding, so the weight on it(which is correspond to every line’s last column) should be zero(this is, scores/probs every line’s last element should be zero)

since every score line’‘s last column should be zero, the score matrix’'s last column should be zero vector.

Another perspective, scores/probs line should sum to 1, but scores/probs column has no restrict.
attention_probs = nn.Softmax(dim=-1)(attention_scores), softamx(dim=-1) means sum() on dim(-2) or dim(0) equals to 1.

May this help you understand

their is another insterseting observation:

context_layer = torch.matmul(attention_probs, value_layer) from

as mentioned above,think about attention_layer_output = prob[ x_5, x_5] * value[x_5], sin prob’s cloumn for padding’s index is zero vector, but prob’s line( for padding’s index) have non-zero element in generally. Think probs as a pairwise interaction matrix (but not symmetric ).

then according 2d matrix mutiply formula(c_i_j = a_line * b_col ), then we know c_I_j won’'t be zero generally. In result, the matrix c should not have zero element, this means attention_layer_output has non zero value for padding token’s embedding row.

as consequence, you’ll see the phenomenon that padding token’s in bert hidden tensor, change to some vector of non-zero(at the same tensor line, dimension=-2).

That’s why people says it’'s not recommended to try to correlate the input token with bert output (cls or hidden ) by index or somthing, to extract per word embedding from bert’s output(which usually thought to be a sentence embedding). you can do this, but it may not act as your expect.

or you may think this hydrid bert output(hidden) as contextual token embeddings, which introduce other token’s information by attention softmax aggregation[[as my later ansers says, information from padding token is totally not used in this procedure]? Then the attempt to mapping bert input token index to bert output index (at -2 dimension) should make sense.

another finding: which my solve your probelm, and answers why attention map could block padding token embedding during the generate of qkv, and during two attention layer:

In a word, after use attention mask(with zero score contribution on padding token when q query from key k’s padding token’s column, by q*K), the original input 512 * 768 and the attention output 512 * 712, all keep their padding token related elements with the previous padding token related row
elements. Attention Layer not change padding token’s one-one mapping(not so precisely).

The same happened in Add and LayerNorm and FFN layer, not change’‘s padding token’'s scope of influence[this is, not change(not add or not del) their contribution in loss(in gradient)].

Lastly, the output of bert model( cls, hidden), cls(1768) is the first line of bert output hidden(512768), as you thought, all elements in hidden(512*768) that related to padding token embedding is the last two row[-2:,768], and cls is not related to padding token embedding, so the following pretrain head (such as MLM or NSP), which only use cls as input( and the original embedding as this link says, but let’s ignore this temporarily), their task loss will not related to padding token embeddings. That’s it!