Memory overhead/usage calculation

Hello,

Is there a way to find out how much GPU memory an inference is using?
For example, giving that I have a text of 256 tokens, when I input it to the LLM, asking it to generate a response with 256 tokens (maximum), how can I know how much GPU memory was used for the task?

1 Like

torch.cuda.reset_peak_memory_stats()

start_mem = torch.cuda.memory_allocated()

Run inference

output = model.generate(input_ids, max_new_tokens=256)

After inference

end_mem = torch.cuda.memory_allocated()

peak_mem = torch.cuda.max_memory_allocated()

2 Likes