I am trying to use a pre-trained CasualLM model to perform inference on millions of samples. For each sample, I will add a bunch of context prompts to explain the task before the sample content. Since this is a single task on all the samples, the context prompts will always remain the same.
Based on the definition of CasualLM (previous tokens cannot see tokens afterward), the context prompts should always have the same KV, regardless of the sample content after them. Is that possible to persist the context prompt KV cache in (GPU) memory after doing an inference of a (batch of) sample and reuse it in the next (batch of) sample inference? So that we don’t need to recompute this part over and over again.