Isn't KV cache influenced by position encoding in inference?

anon58558744 · November 23, 2023, 3:46am

The KV cache can speed up inference because after the first iteration, the KV of the post sentance will not change, therefore we can store it.
However, every iteration will append the input sentance with the new token, and in my opinion, that means the position encoding should be changed, and thus the KV value of the old part is changed too, which makes the KV cache become useless.
So which part of my describition is wrong, and how exactly the KV cache work?

OrfeasFil · May 15, 2024, 4:18pm

Hello there,

Did you find any answer? I have the same question

RaushanTurganbay · May 16, 2024, 8:42am

Hey!

Yes, KV cache works as you described by storing previous keys and values. But I don’t understand why position embeddings for older cache should be changed for correct functioning.

Here’s how it works in transformers:

When we are generating autoregressively, we can add a cache of 5 tokens with positions as [0,1,2,3,4]. And every time we generate the next token, it gets its own position id, naturally following on from the previous ones. . In other words, the new token will have a position [5], then [6] and so on.

OrfeasFil · May 16, 2024, 1:57pm

Wow, you are right, i don’t know how i missed that.

Thank you so much!!

Topic		Replies	Views
Past_key_values - why not past_key_values_queries? Beginners	5	11374	October 15, 2023
KV cache sizing 🤗Transformers	0	771	August 24, 2023
Transformer KV-Cache Produces Worse Output Than Normal Generation – Why? Beginners	1	334	March 3, 2025
How does attention key/value caching work with models that have learned absolute position embeddings? 🤗Transformers	0	1373	September 26, 2023
Outputs change if re-using KVCache (past_key_values) for model.forward and generation 🤗Transformers	5	343	January 22, 2025

Isn't KV cache influenced by position encoding in inference?

Related topics