In the above document, “Static Cache” is marked as having high latency. I’m finding this a bit counterintuitive. My understanding is that a Static Cache, by pre-allocating memory for the cache, should help avoid dynamic memory allocation during inference. This, in turn, should theoretically lead to a reduction in latency. Am I misunderstanding its implementation or the definition of “latency” in the document?
Related topics
Topic | Replies | Views | Activity | |
---|---|---|---|---|
Generate: using k-v cache is faster but no difference to memory usage | 5 | 16112 | June 3, 2025 | |
KV Cache Managment | 0 | 523 | July 4, 2024 | |
Use_cache (and past_key_values) in GPT2 leads to slower inference? | 1 | 1055 | April 9, 2023 | |
Problem in dynamicCache: index -1 is out of bounds for dimension 0 with size 0 in cache_position[-1] | 1 | 93 | February 22, 2025 | |
Isn't KV cache influenced by position encoding in inference? | 3 | 924 | May 16, 2024 |