Hi, I’m studying BERT and am curious about the implementation of BertSelfAttention
and BertSelfOutput
.
According to the Transformer paper(Attention Is All You Need (2017, Vaswani et el.)), multi-head self attention requires 4 Linear layers: one for query, one for key, one for value, and the last for final output.
However, BertSelfAttention
only has three Linear layers; only for query, key, and value. The Linear layer for the final output is in BertSelfOutput
.
Is there any reason why they are implemented like this?