I have an experiment in which I must run inference 100 times. This works fine. Later I must run a larger experiment for 10000 inferences.
This works fine until around 1500 inferences. At which point, the model stops generating. It freezes without any crash.
I am led to believe this is to do with the static KV cache.
So how can I keep the benefits of faster inference with a KV cache whilst periodically emptying it for longer inferences?
To follow up, I feel this should be better documented and that when this freezing occurs transformers should crash gracefully and inform the user what action to take.