BertSelfAttention, BertSelfOutput implementation

Hi, I’m studying BERT and am curious about the implementation of BertSelfAttention and BertSelfOutput.

According to the Transformer paper(Attention Is All You Need (2017, Vaswani et el.)), multi-head self attention requires 4 Linear layers: one for query, one for key, one for value, and the last for final output.

However, BertSelfAttention only has three Linear layers; only for query, key, and value. The Linear layer for the final output is in BertSelfOutput.

Is there any reason why they are implemented like this?

I think this was a design decision by @thomwolf.

It was done like this in the original tensorflow implementation of Bert and can probably be traced back to the tensor2tensor library

Its model/original implementation dependent, DistilBERT does the 4th linear transformation for example.

Thank you all for your reply:-)