I’m looking at the BERT implementation. According to the paper, BERT employs the same encoder block of the original transformer.
I’m may missing something, but I can’t find the LayerNorm + residual connection after the multihead attention in the huggingface implementation. It looks like the residual is added only after the feed forward block. As a reference, I also point out the pytorch code for the vanilla transformer, in which you can clearly see the normalization + residual (and dropout) after the multi head attention.
Can you confirm that there exists such a difference concerning the BERT implementation and possibly provide me some explanation of that?
I agree that this seem to be the case - looking for LayerNorm in the code clearly shows that, within BertLayer (and also RobertaLayer) it’s only applied within BertOutput, i.e. after the feed-forward layer, not also between attention and feedforward. Similarly, as far as I could see, there’s no residual connection skipping attention (although it’s harder to ctrl+F through all the usages of +).
Has anybody found evidence in the literature, or even some blogs, that this doesn’t affect quality of the overall model? I’ve tried looking around for information, but nobody ever mentions training a transformer with only one LayerNorm per transformer block - let alone, omitting the residual connection. Yet HuggingFace’s implementations of BERT/RoBERTa are extremely popular and quite robust, which sounds like a weird combination?