BertSelfAttention, BertSelfOutput implementation

heekang · August 9, 2022, 7:49pm

Hi, I’m studying BERT and am curious about the implementation of BertSelfAttention and BertSelfOutput.

According to the Transformer paper(Attention Is All You Need (2017, Vaswani et el.)), multi-head self attention requires 4 Linear layers: one for query, one for key, one for value, and the last for final output.

However, BertSelfAttention only has three Linear layers; only for query, key, and value. The Linear layer for the final output is in BertSelfOutput.

Is there any reason why they are implemented like this?

nielsr · August 10, 2022, 9:51am

I think this was a design decision by @thomwolf.

thomwolf · August 10, 2022, 12:33pm

It was done like this in the original tensorflow implementation of Bert and can probably be traced back to the tensor2tensor library

ccdv · August 10, 2022, 4:33pm

Its model/original implementation dependent, DistilBERT does the 4th linear transformation for example.

heekang · August 11, 2022, 7:31am

Thank you all for your reply:-)

Topic		Replies	Views
Bert - missing layer norm and resudual after attention block Models	4	1509	October 25, 2023
Swapping out self-attention layer in BERT Research	0	569	January 11, 2023
Understanding BertLMPredictionHead 🤗Transformers	3	2286	February 15, 2021
Specify attention masks for some heads in multi-head attention Intermediate	3	2343	November 17, 2020
Can we access attention component and feed-forward component of a Bert layer? Research	2	978	September 23, 2024

BertSelfAttention, BertSelfOutput implementation

Related topics