Pass CausalLM KV cache into the next inference batch

webber26232 · October 14, 2023, 3:24pm

I am trying to use a pre-trained CasualLM model to perform inference on millions of samples. For each sample, I will add a bunch of context prompts to explain the task before the sample content. Since this is a single task on all the samples, the context prompts will always remain the same.

Based on the definition of CasualLM (previous tokens cannot see tokens afterward), the context prompts should always have the same KV, regardless of the sample content after them. Is that possible to persist the context prompt KV cache in (GPU) memory after doing an inference of a (batch of) sample and reuse it in the next (batch of) sample inference? So that we don’t need to recompute this part over and over again.

Topic		Replies	Views
How to cache common instruction prompt 🤗Transformers	16	2341	October 31, 2024
KV Cache Managment 🤗Transformers	0	504	July 4, 2024
Outputs change if re-using KVCache (past_key_values) for model.forward and generation 🤗Transformers	5	194	January 22, 2025
Generate: using k-v cache is faster but no difference to memory usage 🤗Transformers	5	15811	June 3, 2025
How estimate VRAM needed for prompt according to prompt's size (inference and fine tuning) Beginners	1	1250	September 22, 2023

Pass CausalLM KV cache into the next inference batch

Related topics