Is there a way to terminate llm.generate and release the GPU memory for next prompt?

I’m using llm.generate(prompts, sampling_params) to generate text in batch mode. However, sometimes I need to interrupt the current batch generation and release GPU memory before starting the next batch.

The problem is that even after stopping the generation, the old batch still persists in memory, preventing a truly fresh start for the next round. I don’t want to kill the entire LLM instance—just terminate the current batch and reset memory for the next batch efficiently.

1 Like

I don’t think the Transformers library itself is designed for this. The example below is a pipeline, not a model class, but you’ll probably have to directly manipulate torch in a similar way.