I’m looking at the BERT implementation. According to the paper, BERT employs the same encoder block of the original transformer.
I’m may missing something, but I can’t find the LayerNorm + residual connection after the multihead attention in the huggingface implementation. It looks like the residual is added only after the feed forward block. As a reference, I also point out the pytorch code for the vanilla transformer, in which you can clearly see the normalization + residual (and dropout) after the multi head attention.
Can you confirm that there exists such a difference concerning the BERT implementation and possibly provide me some explanation of that?
Did you ever find the solution to this? I am also confused by this now.
I agree that this seem to be the case - looking for
LayerNorm in the code clearly shows that, within
BertLayer (and also
RobertaLayer) it’s only applied within
BertOutput, i.e. after the feed-forward layer, not also between attention and feedforward. Similarly, as far as I could see, there’s no residual connection skipping attention (although it’s harder to ctrl+F through all the usages of
Has anybody found evidence in the literature, or even some blogs, that this doesn’t affect quality of the overall model? I’ve tried looking around for information, but nobody ever mentions training a transformer with only one
LayerNorm per transformer block - let alone, omitting the residual connection. Yet HuggingFace’s implementations of BERT/RoBERTa are extremely popular and quite robust, which sounds like a weird combination?