another finding: which my solve your probelm, and answers why attention map could block padding token embedding during the generate of qkv, and during two attention layer:
In a word, after use attention mask(with zero score contribution on padding token when q query from key k’s padding token’s column, by q*K), the original input 512 * 768 and the attention output 512 * 712, all keep their padding token related elements with the previous padding token related row
elements. Attention Layer not change padding token’s one-one mapping(not so precisely).
The same happened in Add and LayerNorm and FFN layer, not change’‘s padding token’'s scope of influence[this is, not change(not add or not del) their contribution in loss(in gradient)].
Lastly, the output of bert model( cls, hidden), cls(1768) is the first line of bert output hidden(512768), as you thought, all elements in hidden(512*768) that related to padding token embedding is the last two row[-2:,768], and cls is not related to padding token embedding, so the following pretrain head (such as MLM or NSP), which only use cls as input( and the original embedding as this link says, but let’s ignore this temporarily), their task loss will not related to padding token embeddings. That’s it!