I am using
transformers pipeline for
tokenizer = AutoTokenizer.from_pretrained("./modelfiles")
model = AutoModelForTokenClassification.from_pretrained("./modelfiles")
nlp = pipeline("token-classification", model=model, tokenizer=tokenizer, aggregation_strategy="simple")
ner_results = nlp(text)
The problem here is that on the first call to the function that implements the above lines the memory is released, then on the second call onwards the memory is not released as can be seen from the screenshot :
The first peak is me calling the function and on return it frees up the memory but then on second call onwards it does not… this eventually leads to a crash.
A memory profiler suggests the line
model = AutoModelForTokenClassification.from_pretrained("./modelfiles") to be the problem. As can be seen from the screenshot below:
I have tried setting
model = None before the
return statement and also called
gc.collect() but the problem persists.
Can someone please help me with this as this is always leading to a crash of the application.
Sorry for the bump, but would appreciate some help on this please as I just cant find the source of the problem.
I faced the same issue and was able to avoid it when I defined a class that inhireted the
BertModel class instead of immediately using
Thank you for your answer, but I managed to solve this problem by not loading the model at every call and have a global/app level reference to the model.
More details and minimal solution code posted in my issue here: Transformers model inference via pipeline not releasing memory after 2nd call. Leads to memory leak and crash in Flask web app · Issue #20594 · huggingface/transformers · GitHub
I can’t free the GPU memory. Here’s a minimal example:
os.environ["CUDA_VISIBLE_DEVICES"] = "6"
from transformers import LlamaForCausalLM
model = LlamaForCausalLM.from_pretrained(
print('breakpoint here - is memory freed?')
I am facing the same problem, have you solved it?
Yes. Do cpu garbage collection before the cuda free mem