Isn't KV cache influenced by position encoding in inference?

The KV cache can speed up inference because after the first iteration, the KV of the post sentance will not change, therefore we can store it.
However, every iteration will append the input sentance with the new token, and in my opinion, that means the position encoding should be changed, and thus the KV value of the old part is changed too, which makes the KV cache become useless.
So which part of my describition is wrong, and how exactly the KV cache work?

1 Like

Hello there,

Did you find any answer? I have the same question


Yes, KV cache works as you described by storing previous keys and values. But I don’t understand why position embeddings for older cache should be changed for correct functioning.

Here’s how it works in transformers:

When we are generating autoregressively, we can add a cache of 5 tokens with positions as [0,1,2,3,4]. And every time we generate the next token, it gets its own position id, naturally following on from the previous ones. . In other words, the new token will have a position [5], then [6] and so on.

1 Like

Wow, you are right, i don’t know how i missed that.

Thank you so much!!