@joaogante Thanks a lot for the reply!
I don’t have to worry about the edge case I mentioned since my cached prompt ends with a role tag, so I do not have any experiments ready to try if token healing works.
I have 2 more follow up questions:
- Upon further reading, I am not exactly sure how token healing helps with the problem I mentioned. Correct me if I’m wrong, but doesn’t the tokenization need to be deterministic for reusing KV cache (and prompt healing does not seem to enforce tokenization in that way).
I considered adding an explicit check for the number of tokens that match between the cached prompt and the prefix of the new prompt, and only pass in the cache for that portion, but this seems to add unnecessary CPU cycles and I wonder if there is a way to nudge tokenizers to generate the first few tokens in a predefined manner. - What if I do inference in a batched scenario with left padding. Since absolute positions of the tokens change with adding the pad tokens to the left, I assume that this also affects the positional embeddings and would not allow us to use the non-padded cached prompt’s past KV values. Is there a way around this?