Unfreed GPU memory after inference using AutoTokenizer

I am trying to use AutoModelForCausalLM with Facebook’s OPT models for inference (like in the code below). However, after running, torch.cuda.memory_reserved() returns 20971520, despite deleting all the variables, running the garbage collector and calling cuda.empty_cache(). This becomes an issue when I am trying to run multiple sequential inferences, as at some point I run out of GPU memory.

What could be the cause of this potential memory leak?

import gc
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch

prompt = "A monolithic operating system differs"
torch_device = "cuda" if torch.cuda.is_available() else "cpu"

print(torch.cuda.memory_reserved()) # 0

checkpoint = "facebook/opt-6.7b"
model = AutoModelForCausalLM.from_pretrained(checkpoint).to(torch_device)
tokenizer = AutoTokenizer.from_pretrained(checkpoint)

inputs = tokenizer(prompt, return_tensors="pt").input_ids.to(torch_device)

with torch.no_grad():
    output = model.generate(inputs)

del prompt, torch_device, checkpoint, model, tokenizer, inputs, output
gc.collect()
torch.cuda.empty_cache()
print(torch.cuda.memory_reserved()) # 20971520

If you still need to ensure that the memory is freed before performing the next operation, you can try explicitly unloading the model from memory and creating a new model before each run. For example:# Clear memory before next output model = None tokenizer = None torch.cuda.empty_cache() gc.collect() # Load model before next output model = AutoModelForCausalLM.from_pretrained(checkpoint).to(torch_device) tokenizer = AutoTokenizer.from_pretrained(checkpoint)