I have a question about the sizes of query, key and value vectors. As mentioned in this paper and also demonstrated in this medium, we should be expecting the sizes of query, key and value vectors as [seq_length x seq_length]. But when I print the sizes of the parameter like below, I see the sizes of those vectors as [768 x 768].
for name, param in model.named_parameters():
print(name, param.size())
>>> bert.bert.encoder.layer.0.attention.self.query.weight torch.Size([768, 768])
bert.bert.encoder.layer.0.attention.self.key.weight torch.Size([768, 768])
bert.bert.encoder.layer.0.attention.self.value.weight torch.Size([768, 768])
I am really confused. I feel like I am missing something, could someone please help me figure it out?
In my view, this illustrates that the size of Q, K and V in a single head is actually 768 x 64 if the number of heads is 12. Due to the multi-head attention these are later concatenated to again produce the abovementioned shape of 768 x 768 which you have obeserved.