Bert - missing layer norm and resudual after attention block

I’m looking at the BERT implementation. According to the paper, BERT employs the same encoder block of the original transformer.

I’m may missing something, but I can’t find the LayerNorm + residual connection after the multihead attention in the huggingface implementation. It looks like the residual is added only after the feed forward block. As a reference, I also point out the pytorch code for the vanilla transformer, in which you can clearly see the normalization + residual (and dropout) after the multi head attention.

Can you confirm that there exists such a difference concerning the BERT implementation and possibly provide me some explanation of that?