Prompt caching in pipelines

cagey-squirrel · May 26, 2025, 7:22pm

I have a very long system prompt and a large dataset of short queries appended to the prompt. I would like to use caching mechanism so I don’t process the long prompt again each time.
I have read how to cache common instruction prompt and re-use cache to continue generation but I would like to use this caching mechanism together with pipelines.
More precisely I have a text-generation pipeline for which I would like to provide a system prompt which would be reused for each generation.
Is there such a pipeline available?
If not, how could I make my own text-generation pipeline that mimics the base class but uses the caching mechanism?

John6666 · May 27, 2025, 12:33am

Hmm… Perhaps not yet available…?

github.com/huggingface/transformers

Support for caching prompt hidden states through multiple calls of `generate()`

opened 08:12PM - 15 Jul 23 UTC

closed 03:39PM - 02 Nov 23 UTC

offendo

WIP

### Feature request Hi there, I'd like to be able to re-use the hidden state…s for a common (potentially long) prompt across multiple calls to `model.generate()` in order to reduce redundant computation. Here is how I envision a final API, though I'm sure there are multiple ways to do it. ```python # Load stuff model = AutoModel.from_pretrained('huggyllama/llama-7b') tokenizer = AutoTokenizer.from_pretrained('huggyllama/llama-7b') # Common prompt that we'd prepend to every example prompt = "This is a common prompt in every example." prompt_ids = tokenizer(prompt, return_tensors='pt') # Examples to pass to generate examples = ["Ackbar went to", "Billaba enjoys", "Cody is eating some"] # Generation loop outputs = [] prompt_hidden_state = None for ex in examples: # Current way of doing things out = model.generate( **tokenizer(prompt + ex, return_tensors='pt'), ) # Proposed method to re-use prompt_hidden_state out = model.generate( **tokenizer(x, return_tensors='pt'), common_prompt_ids=prompt_ids, prompt_hidden_state=prompt_hidden_state ) prompt_hidden_state = out.prompt_hidden_state outputs.append(out.sequences) ``` Thanks in advance. ### Motivation A very common pattern for LLM usage is having a common prompt (e.g., instructions and input/output pairs), a sample input, and asking it to generate the sample output. For example: ``` You are a programmer's assistant which converts English descriptions to Python functions. English: <example 1 description> Python: <example 1 function> English: <example 2 description> Python: <example 2 function> English: <example 3 description> Python: <example 3 function> English: <input description> Python: ``` I'd like to be able to cache the common part of the prompt across inputs, that is, everything before `<input description>` which appears in every example to avoid potentially expensive re-computation. ### Your contribution The only existing info I could find is the short discussion [here](https://discuss.huggingface.co/t/avoid-recalculating-hidden-states-between-generate-calls/34209). I tried messing around a bit to get this to work but had little luck. I'm not familiar with the inner-workings of `transformers` and ran into numerous errors. One problem is padding, which if we're using left padding, can cause some misalignment with the prompt hidden states, e.g.: ``` <p> <p> <p> common prompt x_1 x_2 x_3 <p> <p> common prompt x_1 x_2 x_3 x_4 <p> <p> <p> <p> common prompt x_1 x_2 ``` I don't know the best way to solve this. Do we dynamically pad every tensor in `past_key_values`? That seems slow but I don't know if it actually is. If someone can suggest a better/easier way or maybe give some more pointers on how to solve padding. I'd be happy to try again myself. Thanks in advance.

Topic		Replies	Views
Avoid recalculating hidden states between generate calls? 🤗Transformers	3	1231	March 30, 2023
How to cache common instruction prompt 🤗Transformers	16	2894	October 31, 2024
Pipeline Llama3 Text Generation Saving a Memory/Cache Beginners	9	2375	January 5, 2025
Caching computations for common prefix in prompts Beginners	1	836	July 11, 2023
Control EncoderDecoderModel to generate tokens step by step 🤗Transformers	8	2636	June 8, 2022

Prompt caching in pipelines

Related topics