Why is Static Cache latency high?

exhyy · May 29, 2025, 4:11pm

In the above document, “Static Cache” is marked as having high latency. I’m finding this a bit counterintuitive. My understanding is that a Static Cache, by pre-allocating memory for the cache, should help avoid dynamic memory allocation during inference. This, in turn, should theoretically lead to a reduction in latency. Am I misunderstanding its implementation or the definition of “latency” in the document?

Topic		Replies	Views
Generate: using k-v cache is faster but no difference to memory usage 🤗Transformers	5	16112	June 3, 2025
KV Cache Managment 🤗Transformers	0	523	July 4, 2024
Use_cache (and past_key_values) in GPT2 leads to slower inference? 🤗Transformers	1	1055	April 9, 2023
Problem in dynamicCache: index -1 is out of bounds for dimension 0 with size 0 in cache_position[-1] 🤗Transformers	1	93	February 22, 2025
Isn't KV cache influenced by position encoding in inference? 🤗Transformers	3	924	May 16, 2024

Why is Static Cache latency high?

Related topics