AutoModel from_pretrained not releasing memory and causing a memory leak

farazk86 · December 2, 2022, 7:39pm

Hi,

I am using transformers pipeline for token-classification.

    tokenizer = AutoTokenizer.from_pretrained("./modelfiles")
    model = AutoModelForTokenClassification.from_pretrained("./modelfiles")
    nlp = pipeline("token-classification", model=model, tokenizer=tokenizer, aggregation_strategy="simple")
    ner_results = nlp(text)

The problem here is that on the first call to the function that implements the above lines the memory is released, then on the second call onwards the memory is not released as can be seen from the screenshot :

The first peak is me calling the function and on return it frees up the memory but then on second call onwards it does not… this eventually leads to a crash.

A memory profiler suggests the line model = AutoModelForTokenClassification.from_pretrained("./modelfiles") to be the problem. As can be seen from the screenshot below:

I have tried setting model = None before the return statement and also called gc.collect() but the problem persists.

Can someone please help me with this as this is always leading to a crash of the application.

Thank you

farazk86 · December 5, 2022, 2:37am

Sorry for the bump, but would appreciate some help on this please as I just cant find the source of the problem.

3553NC3 · December 6, 2022, 7:20am

I faced the same issue and was able to avoid it when I defined a class that inhireted the BertModel class instead of immediately using AutoModel

farazk86 · December 6, 2022, 2:49pm

Thank you for your answer, but I managed to solve this problem by not loading the model at every call and have a global/app level reference to the model.

More details and minimal solution code posted in my issue here: Transformers model inference via pipeline not releasing memory after 2nd call. Leads to memory leak and crash in Flask web app · Issue #20594 · huggingface/transformers · GitHub

ndvb · May 23, 2023, 1:42pm

I can’t free the GPU memory. Here’s a minimal example:


import os
os.environ["CUDA_VISIBLE_DEVICES"] = "6"
import torch
from transformers import LlamaForCausalLM

model = LlamaForCausalLM.from_pretrained(
            pretrained_model_name_or_path='decapoda-research/llama-7b-hf',
            load_in_8bit=True,
            device_map={'': 0},
        )

del model
torch.cuda.empty_cache()

print('breakpoint here - is memory freed?')

JB28666 · June 25, 2023, 12:51am

Hello,

I am facing the same problem, have you solved it?

ndvb · June 26, 2023, 6:23am

Yes. Do cpu garbage collection before the cuda free mem

pribadihcr · February 7, 2024, 9:07am

@ndvb,
What are the script?
like this ?
gc.collect()
torch.cuda.empty_cache()

Topic		Replies	Views
Unfreed GPU memory after inference using AutoTokenizer Beginners	1	764	March 29, 2024
How do I release memory after using AutoModel.from_pretrained() to load model Beginners	5	922	September 24, 2024
(Memory) error when trying to use AutoModel.from_pretrained Beginners	0	373	December 14, 2023
Host memory still occupied after huggingface model deleted 🤗Transformers	1	219	September 7, 2023
Run_mlm.py cuda error memory after resuming a training 🤗Transformers	4	2924	April 21, 2021

AutoModel from_pretrained not releasing memory and causing a memory leak

Related topics