Positional embedding in GPT-J when using `past_layer`

Hi all,

(Sorry in advance, as a new user I can only post two links in the code, so I’ve kept only the main ones)

My question is related to the rotary embeddings in GTP-J. I think (or thought at least) that I understand how they are supposed to work, but I still can’t figure how they are used exactly when the model is used with past attention.

To explain a bit more clearly, quoting lines from the code:

  1. At lines [211-217] we do the projection into Q,K,V
  2. At lines 226-236 we add RoPE
  3. At lines 247-251 we concatenate the past layer’s keys/values with the current keys/values, so that the attention using the queries can also query the past layer

But at step 3, it seems that the past layer doesn’t have RoPE added and I struggle to understand where the (relative) position embeddings are. How does the current query know its distance from tokens in the past layer?

Thanks all, looking forward to discussing