Why is Static Cache latency high?

exhyy · May 29, 2025, 4:11pm

In the above document, “Static Cache” is marked as having high latency. I’m finding this a bit counterintuitive. My understanding is that a Static Cache, by pre-allocating memory for the cache, should help avoid dynamic memory allocation during inference. This, in turn, should theoretically lead to a reduction in latency. Am I misunderstanding its implementation or the definition of “latency” in the document?

Mdrnfox · May 29, 2025, 4:45pm

This is how I interpreted it. Hugging Face docs says that Static Cache has “High” latency, it isn’t opposing the fact that pre-allocating memory can avoid dynamic allocations—instead, it’s telling you how fast generation runs by default, without any extra steps.

Hope this helps

system · May 30, 2025, 8:01am

This topic was automatically closed 12 hours after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Advice to speed and performance 🤗Transformers	4	7212	December 7, 2020
KV Cache Managment 🤗Transformers	0	502	July 4, 2024
Isn't KV cache influenced by position encoding in inference? 🤗Transformers	3	868	May 16, 2024
Generate: using k-v cache is faster but no difference to memory usage 🤗Transformers	5	15761	June 3, 2025
Using LLM cache Intermediate	0	107	June 12, 2024

Why is Static Cache latency high?

Related topics