I am doing a research project where I use a system prompt with some few-shot ICL to evaluate LLM performance on various tasks. I am comparing CausalLMs (including Llama, Qwen2, BLOOM, etc.) for their performance on this task, but my dataset is very large and the evaluation script takes quite a while to run.
Is it possible to somehow fix some of the KV cache for this instruction + ICL part of the prompt so that it is reused across multiple inferences of the model?
Hey! Yes, we recently added a possibility to copy cache objects so now you can simply copy the same cache as re-use in different generations. Just make sure you are not using the same cache twice, as we perform in-place modification on it when generating
import os, torch, copy
from transformers import AutoModelForCausalLM, AutoTokenizer, DynamicCache
model = AutoModelForCausalLM.from_pretrained(ckpt, torch_dtype=torch.float16)
tokenizer = AutoTokenizer.from_pretrained(ckpt)
prompt_cache = DynamicCache()
inputs = tokenizer(INITIAL_PROMPT, return_tensors="pt"
prompt_cache = model(**inputs, past_key_values = prompt_cache).past_key_values # this is the common prompt cached
new_inputs = tokenizer(INITIAL_PROMPT + prompt, return_tensors="pt").to("cuda")
past_key_values = copy.deepcopy(prompt_cache)
outputs = model.generate(**new_inputs, past_key_values=past_key_values,max_new_tokens=20)
response = tokenizer.batch_decode(outputs)[0]
print(response)
Thanks a lot for the response (it is also interesting to see Anthropic release the prompt caching beta functionality in their API right around the time I have this question XD).
Continuing to my follow-up question: I was wondering if there is a way to ensure the tokenizerâs consistency in tokenizing this prefix prompt_cache?
For example, I wonder if the text immediately after the prefix could change the tokenizer behaviour for how the final few tokens in the prefix are tokenized? I may have a fundamental misunderstanding of how a tokenizerâs behaviour works, but is there a way to ensure the tokenizer always tokenizes this prefix in the same way?
Sorry, the provided script currently doesnât work, as the PR that was supposed to merge the feature was closed. Weâll work on enabling it soon
Good point on tokenization issue, that indeed might happen in some cases so weâll need to account for that. Probably weâll have to do something similar to token healing, cc @joaogante
I donât have to worry about the edge case I mentioned since my cached prompt ends with a role tag, so I do not have any experiments ready to try if token healing works.
I have 2 more follow up questions:
Upon further reading, I am not exactly sure how token healing helps with the problem I mentioned. Correct me if Iâm wrong, but doesnât the tokenization need to be deterministic for reusing KV cache (and prompt healing does not seem to enforce tokenization in that way).
I considered adding an explicit check for the number of tokens that match between the cached prompt and the prefix of the new prompt, and only pass in the cache for that portion, but this seems to add unnecessary CPU cycles and I wonder if there is a way to nudge tokenizers to generate the first few tokens in a predefined manner.
What if I do inference in a batched scenario with left padding. Since absolute positions of the tokens change with adding the pad tokens to the left, I assume that this also affects the positional embeddings and would not allow us to use the non-padded cached promptâs past KV values. Is there a way around this?
Yeah I did the same thing as that workaround, but that wasnât what I was trying to ask in my follow-up. Thanks for linking this PR/ issue thread tho, itâs pretty informative!
Hi @jopan, not sure if youâre still facing this problem but I just got back from a break and ran the prompt_reuse recipe youâve linked. It did not throw any errors for me. Do you mind elaborating on what you did and the error you received?
As for my approach, I used the workaround mentioned in a previous comment in this thread. However, this approach using DynamicCache (in the recipe you linked) seems like the preferred way now.
However, I think this recipe has a bug. It does not discard the last kv cache before reusing it for an extension of the prompt like mentioned in this comment. I did some testing and it looks like the generations doesnât match the non-cached version (whereas using the method in the comment leads to a match). I have drafted a simple fix for the same in this PR if you want a reference.
messages = [
{âroleâ: âsystemâ, âcontentâ: âyou are a help coderâ},
{âroleâ: âuserâ, âcontentâ: prompt },
]
i have difficulty in split this messages with structure to fix prompt and extra prompt, it will tokenize like
<|start_header_id|>
system
<|end_header_id|>
âŚ
<|start_header_id|>
user
<|end_header_id|>
I donât know how to string them and pass to tokenizer() @RaushanTurganbay
@allenwang37 I donât know how exactly the chat template works for your model, but the main idea is that initial prompt should be the common prefix-text for all sequences you have in the dataset. In case it is a simple template where the system prompt format doesnât change depending on different factors, the below should be your initial prompt
messages = [
{âroleâ: âsystemâ, âcontentâ: âExtract variable tokens in the log"},
{âroleâ: âuserâ, âcontentâ: âready=trueâ },
]
this is my chat template , how to transform the chat template to sequence and pass it to tokenzier(), sequence is a string type, i think it is critical to keep the structure of message. inputs_initial_prompt = tokenizer(prefix, return_tensors="pt").to("cuda") prompt_cache = model(**inputs_initial_prompt, past_key_values =prompt_cache).past_key_values
To get formatted prompt using chat templates you can formatted_string = tokenizer.apply_chat_template(conversation, tokenize=False)
Just make sure that concatenating your formatted initial prompt and the continuation is identical to formatting the whole conversation with apply_chat_template. In case it is not identical (might be due to model-specific formatting rules), make sure to manually post-process the output string before calling the tokenizer
Unfortunately, I never ended up fixing that part. It turned out to be more of a hassle than the performance improvement it was providing in my case.
However, if your cached prompt is really long and you the benefit from caching it in a batched setting is significant (which is mostly for training / running some backend experiments on a large corpus), I had the following thought which may be worth exploring:
If you have a common, and manageable, range of tokens required by the suffix prompts for your task: you can generate different KV caches for each of those cases by artificially creating the appropriate left and right paddings. There is a serious time vs space tradeoff here, so be intentful about how much time KV caching really saves you here and how valuable it is.