The problem here is that on the first call to the function that implements the above lines the memory is released, then on the second call onwards the memory is not released as can be seen from the screenshot :
The first peak is me calling the function and on return it frees up the memory but then on second call onwards it does not… this eventually leads to a crash.
A memory profiler suggests the line model = AutoModelForTokenClassification.from_pretrained("./modelfiles") to be the problem. As can be seen from the screenshot below:
Thank you for your answer, but I managed to solve this problem by not loading the model at every call and have a global/app level reference to the model.
I can’t free the GPU memory. Here’s a minimal example:
import os
os.environ["CUDA_VISIBLE_DEVICES"] = "6"
import torch
from transformers import LlamaForCausalLM
model = LlamaForCausalLM.from_pretrained(
pretrained_model_name_or_path='decapoda-research/llama-7b-hf',
load_in_8bit=True,
device_map={'': 0},
)
del model
torch.cuda.empty_cache()
print('breakpoint here - is memory freed?')