Sizes of Query, key and value vector in Bert Model

I have a question about the sizes of query, key and value vectors. As mentioned in this paper and also demonstrated in this medium, we should be expecting the sizes of query, key and value vectors as [seq_length x seq_length]. But when I print the sizes of the parameter like below, I see the sizes of those vectors as [768 x 768].

for name, param in model.named_parameters():                

    print(name, param.size())

>>> bert.bert.encoder.layer.0.attention.self.query.weight torch.Size([768, 768])
       bert.bert.encoder.layer.0.attention.self.key.weight torch.Size([768, 768])
       bert.bert.encoder.layer.0.attention.self.value.weight torch.Size([768, 768])

I am really confused. I feel like I am missing something, could someone please help me figure it out?

You are looking at the weights of the query/key/value heads, not the value vectors.

Oh I see. Could you please mind explaining how vectors are derived from this weights? or give any references for it?

Thank you

If you dive into the source code of class BertSelfAttention(nn.Module) you will find:

    self.num_attention_heads = config.num_attention_heads
    self.attention_head_size = int(config.hidden_size / config.num_attention_heads)
    self.all_head_size = self.num_attention_heads * self.attention_head_size

    self.query = nn.Linear(config.hidden_size, self.all_head_size)
    self.key = nn.Linear(config.hidden_size, self.all_head_size)
    self.value = nn.Linear(config.hidden_size, self.all_head_size)

In my view, this illustrates that the size of Q, K and V in a single head is actually 768 x 64 if the number of heads is 12. Due to the multi-head attention these are later concatenated to again produce the abovementioned shape of 768 x 768 which you have obeserved.