How memory is managed in model.generate() method?

I am running model.generate() method for my task. While running I observed that I start from a GPU memory X, it fills up to Y (capacity of the GPU) and it comes back to a certain memory Z which is Y > Z and refills up to Y again. No OOM error, nothing. What’s actually happening under the hood?

1 Like

During generation what happens with memory is that we store cache of previously computed keys and values (see Best Practices for Generation with Cache for more). And this will result in higher memory consumption in return for a faster generation.

Apart from that if you disabled cache manually, memory consumption can be result of how CUDA memory management works. Sometimes, for some models, CUDA might require more and more memory for each new generation. Since each new generation results in new tensors shapes (+1 in seq length dim), some operations can lead to memory segmentation and CUDA will end up asking for more memory each time. For that you can read more on torch docs about “CUDA memory management”

1 Like

This topic was automatically closed 12 hours after the last reply. New replies are no longer allowed.