Bert attention mask question

kjlhlosghgupg · March 11, 2024, 2:46pm

another finding: which my solve your probelm, and answers why attention map could block padding token embedding during the generate of qkv, and during two attention layer:

In a word, after use attention mask(with zero score contribution on padding token when q query from key k’s padding token’s column, by q*K), the original input 512 * 768 and the attention output 512 * 712, all keep their padding token related elements with the previous padding token related row
elements. Attention Layer not change padding token’s one-one mapping(not so precisely).

The same happened in Add and LayerNorm and FFN layer, not change’‘s padding token’'s scope of influence[this is, not change(not add or not del) their contribution in loss(in gradient)].

Lastly, the output of bert model( cls, hidden), cls(1768) is the first line of bert output hidden(512768), as you thought, all elements in hidden(512*768) that related to padding token embedding is the last two row[-2:,768], and cls is not related to padding token embedding, so the following pretrain head (such as MLM or NSP), which only use cls as input( and the original embedding as this link says, but let’s ignore this temporarily), their task loss will not related to padding token embeddings. That’s it!

Topic		Replies	Views
Is attention_mask implemented correctly in BERT? 🤗Transformers	2	2570	November 12, 2023
Do automatically generated attention masks ignore padding? 🤗Transformers	4	16438	March 8, 2022
Batched BertForMaskedLM inference loss issue Intermediate	0	688	February 23, 2022
Is the attention mask and tokenization taken into account? Beginners	0	349	December 7, 2021
Where in the code does masking of tokens happen when pretraining BERT Beginners	5	7262	August 17, 2020

Bert attention mask question

Related topics