Sizes of Query, key and value vector in Bert Model

Khondoker · September 13, 2020, 4:10pm

I have a question about the sizes of query, key and value vectors. As mentioned in this paper and also demonstrated in this medium, we should be expecting the sizes of query, key and value vectors as [seq_length x seq_length]. But when I print the sizes of the parameter like below, I see the sizes of those vectors as [768 x 768].

for name, param in model.named_parameters():                

    print(name, param.size())

>>> bert.bert.encoder.layer.0.attention.self.query.weight torch.Size([768, 768])
       bert.bert.encoder.layer.0.attention.self.key.weight torch.Size([768, 768])
       bert.bert.encoder.layer.0.attention.self.value.weight torch.Size([768, 768])

I am really confused. I feel like I am missing something, could someone please help me figure it out?

sgugger · September 14, 2020, 1:41pm

You are looking at the weights of the query/key/value heads, not the value vectors.

Khondoker · September 14, 2020, 3:42pm

Oh I see. Could you please mind explaining how vectors are derived from this weights? or give any references for it?

Thank you

simonschoe · March 25, 2021, 9:43am

If you dive into the source code of class BertSelfAttention(nn.Module) you will find:

    self.num_attention_heads = config.num_attention_heads
    self.attention_head_size = int(config.hidden_size / config.num_attention_heads)
    self.all_head_size = self.num_attention_heads * self.attention_head_size

    self.query = nn.Linear(config.hidden_size, self.all_head_size)
    self.key = nn.Linear(config.hidden_size, self.all_head_size)
    self.value = nn.Linear(config.hidden_size, self.all_head_size)

In my view, this illustrates that the size of Q, K and V in a single head is actually 768 x 64 if the number of heads is 12. Due to the multi-head attention these are later concatenated to again produce the abovementioned shape of 768 x 768 which you have obeserved.

Topic		Replies	Views
Self-attention query vs key size in gpt2 🤗Transformers	1	1051	June 17, 2022
What is the input vector size for a BERT and Transformer-XL? 🤗Transformers	1	3554	September 2, 2020
About the Cross-attention Layer Shape in Encoder-Decoder Model 🤗Transformers	1	1912	March 18, 2022
BertSelfAttention, BertSelfOutput implementation 🤗Transformers	4	722	August 11, 2022
Question about all_head_size under BertSelfAttention 🤗Transformers	0	365	July 13, 2020

Sizes of Query, key and value vector in Bert Model

Related topics