I’m looking at the BERT implementation. According to the paper, BERT employs the same encoder block of the original transformer.
I’m may missing something, but I can’t find the LayerNorm + residual connection after the multihead attention in the huggingface implementation. It looks like the residual is added only after the feed forward block. As a reference, I also point out the pytorch code for the vanilla transformer, in which you can clearly see the normalization + residual (and dropout) after the multi head attention.
Can you confirm that there exists such a difference concerning the BERT implementation and possibly provide me some explanation of that?