Storing and loading KV cache

oran-sh · July 23, 2024, 9:19pm

Hello,

is there an explicit way to store and later on load KV cache in the models?

Thanks!

RaushanTurganbay · July 25, 2024, 4:46am

Hey!

You can reuse a cache object in the next generation steps as follows:

out = model.generate(input_ids, use_cache=True, return_dict_in_generate=True)
past_key_values = out.past_key_values
generated_ids = out.sequences

# Now we can continue generation using cache and already generated tokens
out_continued = model.generate(generated_ids, past_key_values=past_key_values, return_dict_in_generate=True)
continued_generated_ids = out_continued.sequences

If you want to save the cache and load it back, we don’t have an explicit way for that. But you can try to save it by saving keys and values, where each of them is a tuple of tensors

keys, values = past_key_values.key_cache, past_key_values.value_cache
torch.save(keys, "keys.pt")
torch.save(values, "values.pt")

# Later you can load it as follows assuming you used the default DynamicCache
from transformers import DynamicCache

past_key_values = DynamicCache()
past_key_values.key_cache = torch.load("keys.pt")
past_key_values.value_cache =values = torch.load("values.pt")

Btw, can you share the use case when saving and loading cache in needed. We are now trying to make a unified API for all cache objects, and it will help us to understand common use cases

oran-sh · August 1, 2024, 7:05am

Thanks for the answer @RaushanTurganbay !
I will start with that and try it out.

Regarding use case -
The use case is that we have a lot of long context examples that we repeatedly query, and so it doesn’t make sense to recalculate everything for every query. It makes more sense to reload the KV values for every query.
Does that make sense?

RaushanTurganbay · August 1, 2024, 8:03am

I see, thanks for explaining. We are making a cache that will inherit from torch Module here, so after the PR is merged you should be able to copy cache or save with torch.save

oran-sh · August 6, 2024, 1:08pm

That’s great, any estimate when this PR will be merged?

allenwang37 · October 21, 2024, 4:35am

@RaushanTurganbay hi, so what is the best practice of kv cache reuse?

RaushanTurganbay · October 21, 2024, 5:35pm

For anyone who stumbles upon this thread, here are the docs for reusing cache. But note that it works for most common cache types and will fail if you want to use Offloaded or Quantized Cache

Topic		Replies	Views
Outputs change if re-using KVCache (past_key_values) for model.forward and generation 🤗Transformers	5	194	January 22, 2025
Why if use cache in gpt2 model from transformers , the logits are different if i do a forward pass from scratch Models	1	355	February 25, 2024
Provide examples to model before inferencing and how to cache the examples Beginners	0	20	March 5, 2025
What does the `use_cache` in `generate` actually do? 🤗Transformers	1	2322	May 9, 2024
How to save and load fine-tune model 🤗Transformers	4	24701	October 25, 2021

Storing and loading KV cache

Related topics