VRAM keeps increasing during sequential llama2-13b inferencing

VRAM consumption starts with 26 GB, then balloons upto 40 GB during sequential inferencing. I wish to operate multiple instances of the service on a single 80GB gpu (non quantized), want to know if there is any way to disable/limit the caching? Or if there’s a command to free up cached GPU memory without impacting the model latency?

We are facing the similar problem, have you found any fix yet? It would be really helpful.