Memory overhead/usage calculation

tiago-machado · June 17, 2025, 7:35pm

Hello,

Is there a way to find out how much GPU memory an inference is using?
For example, giving that I have a text of 256 tokens, when I input it to the LLM, asking it to generate a response with 256 tokens (maximum), how can I know how much GPU memory was used for the task?

Mdrnfox · June 17, 2025, 10:20pm

torch.cuda.reset_peak_memory_stats()

start_mem = torch.cuda.memory_allocated()

Run inference

output = model.generate(input_ids, max_new_tokens=256)

After inference

end_mem = torch.cuda.memory_allocated()

peak_mem = torch.cuda.max_memory_allocated()

tiago-machado · June 20, 2025, 4:50am

Hello, @Mdrnfox ! Thanks for your reply. Do you know the unit that torch is using, is it bits or bytes?

Mdrnfox · June 20, 2025, 10:02am

I believe it is bytes. Element_size() is in bytes

Topic		Replies	Views
Why is the tensor produced by inference so big? Beginners	2	431	April 17, 2023
Memory increasing after hugging face generate method Models	0	39	November 24, 2024
GPU memory GPTJ inference 🤗Transformers	0	237	June 13, 2023
GPU usage increasing every loop when running inference Beginners	2	1065	May 13, 2024
GPU memory usage is twice (2x) what I calculated based on number of parameters and floating point precision Intermediate	5	446	May 18, 2024

Memory overhead/usage calculation

Run inference

After inference

Related topics