(Sorry in advance, as a new user I can only post two links in the code, so I’ve kept only the main ones)
My question is related to the rotary embeddings in GTP-J. I think (or thought at least) that I understand how they are supposed to work, but I still can’t figure how they are used exactly when the model is used with past attention.
To explain a bit more clearly, quoting lines from the code:
- At lines [211-217] we do the projection into Q,K,V
- At lines 226-236 we add RoPE
- At lines 247-251 we concatenate the past layer’s keys/values with the current keys/values, so that the attention using the queries can also query the past layer
But at step 3, it seems that the past layer doesn’t have RoPE added and I struggle to understand where the (relative) position embeddings are. How does the current query know its distance from tokens in the past layer?
Thanks all, looking forward to discussing