Positional embedding in GPT-J when using `past_layer`

DocKaiju · January 13, 2023, 6:34pm

Hi all,

(Sorry in advance, as a new user I can only post two links in the code, so I’ve kept only the main ones)

My question is related to the rotary embeddings in GTP-J. I think (or thought at least) that I understand how they are supposed to work, but I still can’t figure how they are used exactly when the model is used with past attention.

To explain a bit more clearly, quoting lines from the code:

At lines [211-217] we do the projection into Q,K,V
At lines 226-236 we add RoPE
At lines 247-251 we concatenate the past layer’s keys/values with the current keys/values, so that the attention using the queries can also query the past layer

But at step 3, it seems that the past layer doesn’t have RoPE added and I struggle to understand where the (relative) position embeddings are. How does the current query know its distance from tokens in the past layer?

Thanks all, looking forward to discussing
Clément

Topic		Replies	Views
Relative Position Representation/Encoding for Transformer Research	0	1931	February 22, 2022
How does attention key/value caching work with models that have learned absolute position embeddings? 🤗Transformers	0	1354	September 26, 2023
Use transformer without position embeddings being added? Beginners	0	868	June 13, 2021
Shoud we add position embeddings to Values 🤗Transformers	0	7	December 24, 2024
Positional Embeddings in Transformer Implementations 🤗Transformers	1	1780	September 3, 2024

Positional embedding in GPT-J when using `past_layer`

Related topics