AutoModel from_pretrained not releasing memory and causing a memory leak

Hi,

I am using transformers pipeline for token-classification.

    tokenizer = AutoTokenizer.from_pretrained("./modelfiles")
    model = AutoModelForTokenClassification.from_pretrained("./modelfiles")
    nlp = pipeline("token-classification", model=model, tokenizer=tokenizer, aggregation_strategy="simple")
    ner_results = nlp(text)

The problem here is that on the first call to the function that implements the above lines the memory is released, then on the second call onwards the memory is not released as can be seen from the screenshot :

The first peak is me calling the function and on return it frees up the memory but then on second call onwards it does not… this eventually leads to a crash.

A memory profiler suggests the line model = AutoModelForTokenClassification.from_pretrained("./modelfiles") to be the problem. As can be seen from the screenshot below:

I have tried setting model = None before the return statement and also called gc.collect() but the problem persists.

Can someone please help me with this as this is always leading to a crash of the application. :frowning:

Thank you

Sorry for the bump, but would appreciate some help on this please as I just cant find the source of the problem. :frowning:

I faced the same issue and was able to avoid it when I defined a class that inhireted the BertModel class instead of immediately using AutoModel

Thank you for your answer, but I managed to solve this problem by not loading the model at every call and have a global/app level reference to the model.

More details and minimal solution code posted in my issue here: Transformers model inference via pipeline not releasing memory after 2nd call. Leads to memory leak and crash in Flask web app · Issue #20594 · huggingface/transformers · GitHub

I can’t free the GPU memory. Here’s a minimal example:


import os
os.environ["CUDA_VISIBLE_DEVICES"] = "6"
import torch
from transformers import LlamaForCausalLM

model = LlamaForCausalLM.from_pretrained(
            pretrained_model_name_or_path='decapoda-research/llama-7b-hf',
            load_in_8bit=True,
            device_map={'': 0},
        )

del model
torch.cuda.empty_cache()

print('breakpoint here - is memory freed?')