How to cache common instruction prompt

sannat17 · August 8, 2024, 3:36am

I am doing a research project where I use a system prompt with some few-shot ICL to evaluate LLM performance on various tasks. I am comparing CausalLMs (including Llama, Qwen2, BLOOM, etc.) for their performance on this task, but my dataset is very large and the evaluation script takes quite a while to run.

Is it possible to somehow fix some of the KV cache for this instruction + ICL part of the prompt so that it is reused across multiple inferences of the model?

RaushanTurganbay · August 12, 2024, 12:57pm

Hey! Yes, we recently added a possibility to copy cache objects so now you can simply copy the same cache as re-use in different generations. Just make sure you are not using the same cache twice, as we perform in-place modification on it when generating


import os, torch, copy
from transformers import AutoModelForCausalLM, AutoTokenizer, DynamicCache

model = AutoModelForCausalLM.from_pretrained(ckpt, torch_dtype=torch.float16)
tokenizer = AutoTokenizer.from_pretrained(ckpt)

prompt_cache = DynamicCache()
inputs = tokenizer(INITIAL_PROMPT, return_tensors="pt"
prompt_cache = model(**inputs, past_key_values = prompt_cache).past_key_values # this is the common prompt cached


new_inputs = tokenizer(INITIAL_PROMPT + prompt, return_tensors="pt").to("cuda")
past_key_values = copy.deepcopy(prompt_cache)
outputs = model.generate(**new_inputs, past_key_values=past_key_values,max_new_tokens=20) 
response = tokenizer.batch_decode(outputs)[0]
print(response)

sannat17 · August 15, 2024, 7:53am

Thanks a lot for the response (it is also interesting to see Anthropic release the prompt caching beta functionality in their API right around the time I have this question XD).

Continuing to my follow-up question: I was wondering if there is a way to ensure the tokenizer’s consistency in tokenizing this prefix prompt_cache?

For example, I wonder if the text immediately after the prefix could change the tokenizer behaviour for how the final few tokens in the prefix are tokenized? I may have a fundamental misunderstanding of how a tokenizer’s behaviour works, but is there a way to ensure the tokenizer always tokenizes this prefix in the same way?

RaushanTurganbay · August 15, 2024, 11:36am

Sorry, the provided script currently doesn’t work, as the PR that was supposed to merge the feature was closed. We’ll work on enabling it soon

Good point on tokenization issue, that indeed might happen in some cases so we’ll need to account for that. Probably we’ll have to do something similar to token healing, cc @joaogante

joaogante · August 15, 2024, 12:18pm

@sannat17 We already have a solution for that! See the token_healing argument in the GenerationConfig class

kcarnold · August 17, 2024, 3:40pm

Is this the PR you’re referring to? There’s a comment there that includes workaround code.

sannat17 · August 21, 2024, 2:18am

@joaogante Thanks a lot for the reply!

I don’t have to worry about the edge case I mentioned since my cached prompt ends with a role tag, so I do not have any experiments ready to try if token healing works.

I have 2 more follow up questions:

Upon further reading, I am not exactly sure how token healing helps with the problem I mentioned. Correct me if I’m wrong, but doesn’t the tokenization need to be deterministic for reusing KV cache (and prompt healing does not seem to enforce tokenization in that way).
I considered adding an explicit check for the number of tokens that match between the cached prompt and the prefix of the new prompt, and only pass in the cache for that portion, but this seems to add unnecessary CPU cycles and I wonder if there is a way to nudge tokenizers to generate the first few tokens in a predefined manner.
What if I do inference in a batched scenario with left padding. Since absolute positions of the tokens change with adding the pad tokens to the left, I assume that this also affects the positional embeddings and would not allow us to use the non-padded cached prompt’s past KV values. Is there a way around this?

sannat17 · August 21, 2024, 2:37am

Yeah I did the same thing as that workaround, but that wasn’t what I was trying to ask in my follow-up. Thanks for linking this PR/ issue thread tho, it’s pretty informative!

jopan · September 20, 2024, 8:15pm

Hi @sannat17, do you mind sharing your work around? I am trying something like huggingface-llama-recipes/prompt_reuse.py at main · huggingface/huggingface-llama-recipes · GitHub, but I come across an error when it does not allow me to deep copy past_key_values.
Thank you very much.

sannat17 · October 21, 2024, 4:26am

Hi @jopan, not sure if you’re still facing this problem but I just got back from a break and ran the prompt_reuse recipe you’ve linked. It did not throw any errors for me. Do you mind elaborating on what you did and the error you received?

As for my approach, I used the workaround mentioned in a previous comment in this thread. However, this approach using DynamicCache (in the recipe you linked) seems like the preferred way now.

However, I think this recipe has a bug. It does not discard the last kv cache before reusing it for an extension of the prompt like mentioned in this comment. I did some testing and it looks like the generations doesn’t match the non-cached version (whereas using the method in the comment leads to a match). I have drafted a simple fix for the same in this PR if you want a reference.

allenwang37 · October 21, 2024, 5:14am

messages = [
{“role”: “system”, “content”: “you are a help coder”},
{“role”: “user”, “content”: prompt },
]
i have difficulty in split this messages with structure to fix prompt and extra prompt, it will tokenize like
<|start_header_id|>
system
<|end_header_id|>
…
<|start_header_id|>
user
<|end_header_id|>

I don’t know how to string them and pass to tokenizer()
@RaushanTurganbay

RaushanTurganbay · October 21, 2024, 10:59am

@allenwang37 I don’t know how exactly the chat template works for your model, but the main idea is that initial prompt should be the common prefix-text for all sequences you have in the dataset. In case it is a simple template where the system prompt format doesn’t change depending on different factors, the below should be your initial prompt

<|start_header_id|>
system
<|end_header_id|>

allenwang37 · October 21, 2024, 12:36pm

messages = [
{“role”: “system”, “content”: “Extract variable tokens in the log"},
{“role”: “user”, “content”: “ready=true” },
]

this is my chat template , how to transform the chat template to sequence and pass it to tokenzier(), sequence is a string type, i think it is critical to keep the structure of message.
inputs_initial_prompt = tokenizer(prefix, return_tensors="pt").to("cuda") prompt_cache = model(**inputs_initial_prompt, past_key_values =prompt_cache).past_key_values

RaushanTurganbay · October 21, 2024, 5:29pm

To get formatted prompt using chat templates you can formatted_string = tokenizer.apply_chat_template(conversation, tokenize=False)

Just make sure that concatenating your formatted initial prompt and the continuation is identical to formatting the whole conversation with apply_chat_template. In case it is not identical (might be due to model-specific formatting rules), make sure to manually post-process the output string before calling the tokenizer

allenwang37 · October 22, 2024, 4:46am

thank you so much， a lots of practice and knowledge in documents！

mearcstapa-gqz · October 29, 2024, 5:28am

@sannat17 Hi I noted the in How to cache common instruction prompt - #7 by sannat17, you want to make a batched inference with prefix caching. I have a similar issue here. cache wrong code · Issue #34232 · huggingface/transformers · GitHub
Did you make it work?
Thanks!!

sannat17 · October 31, 2024, 6:07pm

Unfortunately, I never ended up fixing that part. It turned out to be more of a hassle than the performance improvement it was providing in my case.

However, if your cached prompt is really long and you the benefit from caching it in a batched setting is significant (which is mostly for training / running some backend experiments on a large corpus), I had the following thought which may be worth exploring:

If you have a common, and manageable, range of tokens required by the suffix prompts for your task: you can generate different KV caches for each of those cases by artificially creating the appropriate left and right paddings. There is a serious time vs space tradeoff here, so be intentful about how much time KV caching really saves you here and how valuable it is.

Topic		Replies	Views
Prompt caching in pipelines 🤗Transformers	1	57	May 27, 2025
Caching computations for common prefix in prompts Beginners	1	815	July 11, 2023
Pass CausalLM KV cache into the next inference batch 🤗Transformers	0	563	October 14, 2023
Is There a Way to Improve Memory Usage When Using Identical `past_key_values` for All Samples in a Batch? 🤗Transformers	3	388	October 21, 2024
Pipeline Llama3 Text Generation Saving a Memory/Cache Beginners	9	2280	January 5, 2025

How to cache common instruction prompt

Related topics