Continous increase in Memory usage

I am currently using transformer pipeline to deploy a speech to text model.
Model i am using is distil-whisper/small.en

I am doing this to create a live speech to text engine and i am deploying the server on a gpu.

The issue i am facing on gpu is that the ram usage is continously increasing and is not clearing.
While debugging the issue i tracked it till here where when i am trnascribing the audio data, memory usage is incremented and is not freeing after transcriptions.

Below is the sample code i am using to load the model and use the pipeline:
def init(self) → None:
# logger.info(f"Loading model {self.model}")
device = “cuda:0” if torch.cuda.is_available() else “cpu”
torch_dtype = torch.float32
logger.info(torch_dtype)
logger.info(device)
model_id = “distil-whisper/distil-small.en”
# model_id = “openai/whisper-tiny.en”
self.model = AutoModelForSpeechSeq2Seq.from_pretrained(
model_id,
torch_dtype=torch_dtype,
low_cpu_mem_usage=False,
use_safetensors=True,
use_cache=False
)
self.model.to(device)

    self.processor = AutoProcessor.from_pretrained(model_id)
    self.transcriber = pipeline(
        "automatic-speech-recognition",
        model=self.model,
        tokenizer=self.processor.tokenizer,
        feature_extractor=self.processor.feature_extractor,
        max_new_tokens=128,
        use_fast=False,
        chunk_length_s=10,
        batch_size=8,
        torch_dtype=torch_dtype,
        device=device
    )
    logger.info("Model loaded")

mem_usage = memory_usage(max_usage=True) # Get max memory usage in MB
logger.info(f"Current memory usage in transciber before generating output: {mem_usage} MB")

        text = self.transcriber(arr)
        # Clean up

        del arr, audio_data
        torch.cuda.empty_cache()
    
    mem_usage = memory_usage(max_usage=True)  # Get max memory usage in MB
    logger.info(f"Current memory usage in transciber after generating output: {mem_usage} MB")

In above code there is a increment of around 40MB of memory every time this is being transcribed which keeps on adding and increasing sometimes the increase is around 200MB.

If someone have any idea about this, please help

2 Likes

This will solve the problem in simple cases, but I can’t say for sure whether this will work in your case without reading it carefully.

import gc
del arr, audio_data
gc.collect()

Actually i have tried the above approach but it is not freeing the memory

1 Like

If that’s the case, the problem is either in the loop outside the transcribe function, or is it this…?

arr.to("cpu") # offload before deleting
del arr, audio_data
gc.collect()
torch.cuda.empty_cache()
1 Like

Actually i don’t think the issue is with that, the below lines are in a different function
let’s say:
def transcibe(audio_data):
mem_usage = memory_usage(max_usage=True) # Get max memory usage in MB
logger.info(f"Current memory usage in transciber before generating output: {mem_usage} MB")

        text = self.transcriber(arr)
        # Clean up

        del arr, audio_data
        torch.cuda.empty_cache()
    
    mem_usage = memory_usage(max_usage=True)  # Get max memory usage in MB
    logger.info(f"Current memory usage in transciber after generating output: {mem_

this self.transcriber is consuming it, the significant increase in memory is between these two lines

and self.transcriber is basically huggingface pipeline i.e

self.transcriber = pipeline(
“automatic-speech-recognition”,
model=self.model,
tokenizer=self.processor.tokenizer,
feature_extractor=self.processor.feature_extractor,
max_new_tokens=128,
use_fast=False,
chunk_length_s=10,
batch_size=8,
torch_dtype=torch_dtype,
device=device
)

so is the pipeline using some memory and not clearing it. I am quite not able to figure out the issue here

when i ran the same code on colab as well, where i transcribed a audio file using the same pipeline continously, there also ram usage was increasing slowly and it was not freeing

1 Like

batch_size=8,

I see, so the pipeline is suspicious. Then, this is about the only thing that is suspicious. There is also the fact that the batch increases the memory required, but it is also simply prone to bugs.

Thanks for the quick response, i will check the above mentioned pieces and will update soon

1 Like

@John6666 i tried changing the batch size to 1, along with the multiple variations but no luck the memory of cpu is still increasing

1 Like

Oh… what the heck

Whisper problem?

Oh so thats the problem, thank you so much man, will look for some solution to clear the memory of model

1 Like

@John6666 i tried different approaches, do you have any idea how i can clear the memory used by the model? by deleting or offloading?

i tried appraoches mentioned in the above link but none of them worked for me

1 Like

The only way to do this from Python is to offload the torch model and tensors to the CPU from as appropriate a scope as possible, delete the objects themselves in detail, and then call gc and empty_cache() after making sure that the tensors are not being referenced from anywhere. Be careful, as there are cases where tqdm and other such tools are implicitly referencing them.
In other words, this is the current approach. If this doesn’t work, something is wrong. You should suspect a bug or a problem with the library.
Another method is to separate the execution of the model into a separate script and execute it in a sub-process. This way, the OS will manage the memory, so it is more forceful than Python. However, it is not clean and it takes time.

@not-lain This could be a tricky VRAM leak problem.

1 Like