I am trying to use AutoModelForCausalLM with Facebook’s OPT models for inference (like in the code below). However, after running, torch.cuda.memory_reserved()
returns 20971520
, despite deleting all the variables, running the garbage collector and calling cuda.empty_cache(). This becomes an issue when I am trying to run multiple sequential inferences, as at some point I run out of GPU memory.
What could be the cause of this potential memory leak?
import gc
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch
prompt = "A monolithic operating system differs"
torch_device = "cuda" if torch.cuda.is_available() else "cpu"
print(torch.cuda.memory_reserved()) # 0
checkpoint = "facebook/opt-6.7b"
model = AutoModelForCausalLM.from_pretrained(checkpoint).to(torch_device)
tokenizer = AutoTokenizer.from_pretrained(checkpoint)
inputs = tokenizer(prompt, return_tensors="pt").input_ids.to(torch_device)
with torch.no_grad():
output = model.generate(inputs)
del prompt, torch_device, checkpoint, model, tokenizer, inputs, output
gc.collect()
torch.cuda.empty_cache()
print(torch.cuda.memory_reserved()) # 20971520