How to cache common instruction prompt

sannat17 · August 21, 2024, 2:18am

@joaogante Thanks a lot for the reply!

I don’t have to worry about the edge case I mentioned since my cached prompt ends with a role tag, so I do not have any experiments ready to try if token healing works.

I have 2 more follow up questions:

Upon further reading, I am not exactly sure how token healing helps with the problem I mentioned. Correct me if I’m wrong, but doesn’t the tokenization need to be deterministic for reusing KV cache (and prompt healing does not seem to enforce tokenization in that way).
I considered adding an explicit check for the number of tokens that match between the cached prompt and the prefix of the new prompt, and only pass in the cache for that portion, but this seems to add unnecessary CPU cycles and I wonder if there is a way to nudge tokenizers to generate the first few tokens in a predefined manner.
What if I do inference in a batched scenario with left padding. Since absolute positions of the tokens change with adding the pad tokens to the left, I assume that this also affects the positional embeddings and would not allow us to use the non-padded cached prompt’s past KV values. Is there a way around this?

Topic		Replies	Views
Pass CausalLM KV cache into the next inference batch 🤗Transformers	0	574	October 14, 2023
Past_key_value with multiple new tokens Intermediate	1	1382	August 10, 2023
Prompt caching in pipelines 🤗Transformers	1	94	May 27, 2025
Outputs change if re-using KVCache (past_key_values) for model.forward and generation 🤗Transformers	5	311	January 22, 2025
Caching computations for common prefix in prompts Beginners	1	828	July 11, 2023

How to cache common instruction prompt

Related topics