Bert - missing layer norm and resudual after attention block

I’m looking at the BERT implementation. According to the paper, BERT employs the same encoder block of the original transformer.

I’m may missing something, but I can’t find the LayerNorm + residual connection after the multihead attention in the huggingface implementation. It looks like the residual is added only after the feed forward block. As a reference, I also point out the pytorch code for the vanilla transformer, in which you can clearly see the normalization + residual (and dropout) after the multi head attention.

Can you confirm that there exists such a difference concerning the BERT implementation and possibly provide me some explanation of that?

1 Like

Did you ever find the solution to this? I am also confused by this now.

I agree that this seem to be the case - looking for LayerNorm in the code clearly shows that, within BertLayer (and also RobertaLayer) it’s only applied within BertOutput, i.e. after the feed-forward layer, not also between attention and feedforward. Similarly, as far as I could see, there’s no residual connection skipping attention (although it’s harder to ctrl+F through all the usages of +).

Has anybody found evidence in the literature, or even some blogs, that this doesn’t affect quality of the overall model? I’ve tried looking around for information, but nobody ever mentions training a transformer with only one LayerNorm per transformer block - let alone, omitting the residual connection. Yet HuggingFace’s implementations of BERT/RoBERTa are extremely popular and quite robust, which sounds like a weird combination?

It is defined here:

Thank you, to be honest I don’t know how I could miss that :sweat_smile: (or even how two different people could miss it independently!)

Maybe what put me off is that, in the skip connection, there is one dense layer and one dropout, neither of which is actually there in the original transformer model. So, I guess, HuggingFace’s implementation of BERT/RoBERTa actually has a few layers more, not less than the original transformer? There’s one more dense layer, followed by dropout, hence also a few more weights.

The side effect is also that there is no “highway” across any transformer layer: the attention output is not summed to the unchanged input, but rather to the output of another linear layer. It probably doesn’t matter much, it just makes it a bit harder to pass forward the input unchanged.

Or am I still missing something?

Anyway, thanks for the pointer!