I am trying to analyze attention data from Reformer model. I use the following settings:
I expect attention (output) shape to be:
(batch_size, number_of_heads, number_of_chunks, 64, 64)
The result (shape of attention for each layer) I get is:
torch.Size([1, 2, 41, 64, 128])
the first 4 match what I expect. the last dimension is 128 instead of 64.
How do I interpret the last dimension? (128) ?
The output feature size of the attention layer is 128 - does it mean the output
(query_key): Linear(in_features=256, out_features=128, bias=False)
then what about the 64 X 64? (for attention)