Why is Static Cache latency high?

In the above document, “Static Cache” is marked as having high latency. I’m finding this a bit counterintuitive. My understanding is that a Static Cache, by pre-allocating memory for the cache, should help avoid dynamic memory allocation during inference. This, in turn, should theoretically lead to a reduction in latency. Am I misunderstanding its implementation or the definition of “latency” in the document?

1 Like

This is how I interpreted it. Hugging Face docs says that Static Cache has “High” latency, it isn’t opposing the fact that pre-allocating memory can avoid dynamic allocations—instead, it’s telling you how fast generation runs by default, without any extra steps.

Hope this helps :slight_smile:

1 Like

This topic was automatically closed 12 hours after the last reply. New replies are no longer allowed.