Hi,
When using the chunked self attention layer in Reformer, the attention weight matrix has got a shape which is different than using global self attention. The documentation doesn’t give any information about this so I dig into the code to better understand why. It seems to be related chunk mechanism.
However, I struggled to recover the equivalent attention weight matrix as in the classical global attention layer.
Does anyone has any idea how to do such thing ?
Global attention: attention weight shape (batch_size, num_heads, sequence_length, sequence_length)
Chunked attention: attention weight shape (batch_size, num_heads, sequence_length, num_chunk, attn_chunk_length, attn_chunk_length * (1 + num_chunks_before + num_chunks_after)
Thanks