Continous increase in Memory usage

xavier7778 · November 26, 2024, 10:18am

I am currently using transformer pipeline to deploy a speech to text model.
Model i am using is distil-whisper/small.en

I am doing this to create a live speech to text engine and i am deploying the server on a gpu.

The issue i am facing on gpu is that the ram usage is continously increasing and is not clearing.
While debugging the issue i tracked it till here where when i am trnascribing the audio data, memory usage is incremented and is not freeing after transcriptions.

Below is the sample code i am using to load the model and use the pipeline:
def init(self) → None:
# logger.info(f"Loading model {self.model}")
device = “cuda:0” if torch.cuda.is_available() else “cpu”
torch_dtype = torch.float32
logger.info(torch_dtype)
logger.info(device)
model_id = “distil-whisper/distil-small.en”
# model_id = “openai/whisper-tiny.en”
self.model = AutoModelForSpeechSeq2Seq.from_pretrained(
model_id,
torch_dtype=torch_dtype,
low_cpu_mem_usage=False,
use_safetensors=True,
use_cache=False
)
self.model.to(device)

    self.processor = AutoProcessor.from_pretrained(model_id)
    self.transcriber = pipeline(
        "automatic-speech-recognition",
        model=self.model,
        tokenizer=self.processor.tokenizer,
        feature_extractor=self.processor.feature_extractor,
        max_new_tokens=128,
        use_fast=False,
        chunk_length_s=10,
        batch_size=8,
        torch_dtype=torch_dtype,
        device=device
    )
    logger.info("Model loaded")

mem_usage = memory_usage(max_usage=True) # Get max memory usage in MB
logger.info(f"Current memory usage in transciber before generating output: {mem_usage} MB")

        text = self.transcriber(arr)
        # Clean up

        del arr, audio_data
        torch.cuda.empty_cache()
    
    mem_usage = memory_usage(max_usage=True)  # Get max memory usage in MB
    logger.info(f"Current memory usage in transciber after generating output: {mem_usage} MB")

In above code there is a increment of around 40MB of memory every time this is being transcribed which keeps on adding and increasing sometimes the increase is around 200MB.

If someone have any idea about this, please help

John6666 · November 26, 2024, 1:10pm

This will solve the problem in simple cases, but I can’t say for sure whether this will work in your case without reading it carefully.

import gc
del arr, audio_data
gc.collect()

xavier7778 · November 26, 2024, 2:46pm

Actually i have tried the above approach but it is not freeing the memory

John6666 · November 26, 2024, 2:56pm

If that’s the case, the problem is either in the loop outside the transcribe function, or is it this…?

arr.to("cpu") # offload before deleting
del arr, audio_data
gc.collect()
torch.cuda.empty_cache()

xavier7778 · November 26, 2024, 3:33pm

Actually i don’t think the issue is with that, the below lines are in a different function
let’s say:
def transcibe(audio_data):
mem_usage = memory_usage(max_usage=True) # Get max memory usage in MB
logger.info(f"Current memory usage in transciber before generating output: {mem_usage} MB")

        text = self.transcriber(arr)
        # Clean up

        del arr, audio_data
        torch.cuda.empty_cache()
    
    mem_usage = memory_usage(max_usage=True)  # Get max memory usage in MB
    logger.info(f"Current memory usage in transciber after generating output: {mem_

this self.transcriber is consuming it, the significant increase in memory is between these two lines

and self.transcriber is basically huggingface pipeline i.e

self.transcriber = pipeline(
“automatic-speech-recognition”,
model=self.model,
tokenizer=self.processor.tokenizer,
feature_extractor=self.processor.feature_extractor,
max_new_tokens=128,
use_fast=False,
chunk_length_s=10,
batch_size=8,
torch_dtype=torch_dtype,
device=device
)

so is the pipeline using some memory and not clearing it. I am quite not able to figure out the issue here

when i ran the same code on colab as well, where i transcribed a audio file using the same pipeline continously, there also ram usage was increasing slowly and it was not freeing

John6666 · November 26, 2024, 3:45pm

batch_size=8,

I see, so the pipeline is suspicious. Then, this is about the only thing that is suspicious. There is also the fact that the batch increases the memory required, but it is also simply prone to bugs.

github.com/huggingface/transformers

Pipelines: batch size

opened 05:00PM - 08 Nov 21 UTC

closed 03:01PM - 18 Dec 21 UTC

ioana-blue

I'm using a pipeline with feature extraction and I'm guessing (based on the fact… that it runs fine on the cpu but dies with out of memory on gpu) that the `batch_size` parameter that I pass in is ignored. Can pipeline be used with a batch size and what's the right parameter to use for that? This is how I use the feature extraction: ``` # use pipelines and feature extraction feature_extractor = pipeline( task="feature-extraction", model=model_args.model_name_or_path, config = config, tokenizer = tokenizer, framework="pt", batch_size=data_args.batch_size, truncation=True, ) .... outputs = feature_extractor(inputs = predict_inputs, truncation=True) ``` @Narsil has been really helpful with pipelines, perhaps he knows the answer?

xavier7778 · November 27, 2024, 4:52am

Thanks for the quick response, i will check the above mentioned pieces and will update soon

xavier7778 · November 27, 2024, 8:26am

@John6666 i tried changing the batch size to 1, along with the multiple variations but no luck the memory of cpu is still increasing

John6666 · November 27, 2024, 9:07am

Oh… what the heck

John6666 · November 27, 2024, 9:11am

Whisper problem?

xavier7778 · November 27, 2024, 1:57pm

Oh so thats the problem, thank you so much man, will look for some solution to clear the memory of model

xavier7778 · December 1, 2024, 11:20am

@John6666 i tried different approaches, do you have any idea how i can clear the memory used by the model? by deleting or offloading?

i tried appraoches mentioned in the above link but none of them worked for me

John6666 · December 1, 2024, 11:28am

The only way to do this from Python is to offload the torch model and tensors to the CPU from as appropriate a scope as possible, delete the objects themselves in detail, and then call gc and empty_cache() after making sure that the tensors are not being referenced from anywhere. Be careful, as there are cases where tqdm and other such tools are implicitly referencing them.
In other words, this is the current approach. If this doesn’t work, something is wrong. You should suspect a bug or a problem with the library.
Another method is to separate the execution of the model into a separate script and execute it in a sub-process. This way, the OS will manage the memory, so it is more forceful than Python. However, it is not clean and it takes time.

@not-lain This could be a tricky VRAM leak problem.

Topic		Replies	Views
Continuous Memory Usage increasing 🤗Transformers	0	76	November 26, 2024
Performing Whisper's "transcribe" with Transformer pipelines Beginners	2	2675	December 19, 2023
Clear GPU memory of transformers.pipeline Beginners	6	23865	March 19, 2025
AutoModel from_pretrained not releasing memory and causing a memory leak 🤗Transformers	7	7729	February 7, 2024
Wav2vec2 not releasing memory after batch Models	1	469	May 22, 2023

Continous increase in Memory usage

Related topics