Bert - missing layer norm and resudual after attention block

dcaffo · November 28, 2022, 7:52am

I’m looking at the BERT implementation. According to the paper, BERT employs the same encoder block of the original transformer.

I’m may missing something, but I can’t find the LayerNorm + residual connection after the multihead attention in the huggingface implementation. It looks like the residual is added only after the feed forward block. As a reference, I also point out the pytorch code for the vanilla transformer, in which you can clearly see the normalization + residual (and dropout) after the multi head attention.

Can you confirm that there exists such a difference concerning the BERT implementation and possibly provide me some explanation of that?

Jackmin108 · August 28, 2023, 7:58pm

Did you ever find the solution to this? I am also confused by this now.

mspinaci · September 27, 2023, 2:00pm

I agree that this seem to be the case - looking for LayerNorm in the code clearly shows that, within BertLayer (and also RobertaLayer) it’s only applied within BertOutput, i.e. after the feed-forward layer, not also between attention and feedforward. Similarly, as far as I could see, there’s no residual connection skipping attention (although it’s harder to ctrl+F through all the usages of +).

Has anybody found evidence in the literature, or even some blogs, that this doesn’t affect quality of the overall model? I’ve tried looking around for information, but nobody ever mentions training a transformer with only one LayerNorm per transformer block - let alone, omitting the residual connection. Yet HuggingFace’s implementations of BERT/RoBERTa are extremely popular and quite robust, which sounds like a weird combination?

Hannibal046 · October 24, 2023, 8:41am

It is defined here: https://github.com/huggingface/transformers/blob/ede051f1b85a33d2e0576b48042a58dc5332ed70/src/transformers/models/bert/modeling_bert.py#L378C33-L378C33

mspinaci · October 25, 2023, 7:03am

Thank you, to be honest I don’t know how I could miss that (or even how two different people could miss it independently!)

Maybe what put me off is that, in the skip connection, there is one dense layer and one dropout, neither of which is actually there in the original transformer model. So, I guess, HuggingFace’s implementation of BERT/RoBERTa actually has a few layers more, not less than the original transformer? There’s one more dense layer, followed by dropout, hence also a few more weights.

The side effect is also that there is no “highway” across any transformer layer: the attention output is not summed to the unchanged input, but rather to the output of another linear layer. It probably doesn’t matter much, it just makes it a bit harder to pass forward the input unchanged.

Or am I still missing something?

Anyway, thanks for the pointer!

Topic		Replies	Views
BertSelfAttention, BertSelfOutput implementation 🤗Transformers	4	714	August 11, 2022
Correct way to implement custom model on top of pretrained bert? Beginners	0	903	November 19, 2022
Specify attention masks for some heads in multi-head attention Intermediate	3	2343	November 17, 2020
BERT performs worse than other implementations? 🤗Transformers	0	779	July 24, 2020
About the Cross-attention Layer Shape in Encoder-Decoder Model 🤗Transformers	1	1912	March 18, 2022

Bert - missing layer norm and resudual after attention block

Related topics