KV Cache Managment

swtb · July 4, 2024, 8:20am

I have an experiment in which I must run inference 100 times. This works fine. Later I must run a larger experiment for 10000 inferences.

This works fine until around 1500 inferences. At which point, the model stops generating. It freezes without any crash.

I am led to believe this is to do with the static KV cache.

So how can I keep the benefits of faster inference with a KV cache whilst periodically emptying it for longer inferences?

To follow up, I feel this should be better documented and that when this freezing occurs transformers should crash gracefully and inform the user what action to take.

Topic		Replies	Views
Isn't KV cache influenced by position encoding in inference? 🤗Transformers	3	869	May 16, 2024
Transformer KV-Cache Produces Worse Output Than Normal Generation – Why? Beginners	1	146	March 3, 2025
KV Cache size shrinks during Inference instead of growing. Can someone explain why? 🤗Transformers	0	419	February 14, 2024
Generate: using k-v cache is faster but no difference to memory usage 🤗Transformers	5	15773	June 3, 2025
Continuous execution lead to decreasing inference time Beginners	0	17	October 28, 2024

KV Cache Managment

Related topics