VRAM keeps increasing during sequential llama2-13b inferencing

xerxes01 · December 11, 2023, 9:58am

VRAM consumption starts with 26 GB, then balloons upto 40 GB during sequential inferencing. I wish to operate multiple instances of the service on a single 80GB gpu (non quantized), want to know if there is any way to disable/limit the caching? Or if there’s a command to free up cached GPU memory without impacting the model latency?

jatingarg546 · July 15, 2024, 5:36am

We are facing the similar problem, have you found any fix yet? It would be really helpful.

Topic		Replies	Views
Llama 3.1 8b Instruct - Memory Usage More than Reported Models	5	462	February 18, 2025
GPU Optimisation Quantised Llama x Nvidia T4 Beginners	2	217	January 8, 2025
Pipeline vram problem with google colab Beginners	2	28	January 21, 2025
GPU usage increasing every loop when running inference Beginners	2	1060	May 13, 2024
Multi-GPU inference with accelerate Beginners	0	1713	October 19, 2023

VRAM keeps increasing during sequential llama2-13b inferencing

Related topics