Is there a way to terminate llm.generate and release the GPU memory for next prompt?

I don’t think the Transformers library itself is designed for this. The example below is a pipeline, not a model class, but you’ll probably have to directly manipulate torch in a similar way.