Memory Usage for Inference Depending on Size of Input Data

inkor · September 18, 2023, 3:33am

Yes. I also have this problem. I’m experimenting with llamas with 13 billion parameters in fp16, and if the context window is loaded to the maximum, then the memory can be almost twice as much as the model itself occupies in memory. But everywhere they write that the overhead is a maximum of 20 percent of the model.

Topic		Replies	Views
LLM ingores max_memory in inference Models	0	130	June 20, 2024
Loading of a model takes much RAM, passing to CUDA doesn't free RAM 🤗Transformers	0	774	August 8, 2021
GPU usage increasing every loop when running inference Beginners	2	1068	May 13, 2024
Memory overhead/usage calculation Intermediate	3	50	June 20, 2025
Why is the tensor produced by inference so big? Beginners	2	432	April 17, 2023

Memory Usage for Inference Depending on Size of Input Data

Related topics