I am currently using transformer pipeline to deploy a speech to text model.
Model i am using is distil-whisper/small.en
I am doing this to create a live speech to text engine and i am deploying the server on a gpu.
The issue i am facing on gpu is that the ram usage is continously increasing and is not clearing.
While debugging the issue i tracked it till here where when i am trnascribing the audio data, memory usage is incremented and is not freeing after transcriptions.
Below is the sample code i am using to load the model and use the pipeline:
def init(self) → None:
# logger.info(f"Loading model {self.model}")
device = “cuda:0” if torch.cuda.is_available() else “cpu”
torch_dtype = torch.float32
logger.info(torch_dtype)
logger.info(device)
model_id = “distil-whisper/distil-small.en”
# model_id = “openai/whisper-tiny.en”
self.model = AutoModelForSpeechSeq2Seq.from_pretrained(
model_id,
torch_dtype=torch_dtype,
low_cpu_mem_usage=False,
use_safetensors=True,
use_cache=False
)
self.model.to(device)
self.processor = AutoProcessor.from_pretrained(model_id)
self.transcriber = pipeline(
"automatic-speech-recognition",
model=self.model,
tokenizer=self.processor.tokenizer,
feature_extractor=self.processor.feature_extractor,
max_new_tokens=128,
use_fast=False,
chunk_length_s=10,
batch_size=8,
torch_dtype=torch_dtype,
device=device
)
logger.info("Model loaded")
mem_usage = memory_usage(max_usage=True) # Get max memory usage in MB
logger.info(f"Current memory usage in transciber before generating output: {mem_usage} MB")
text = self.transcriber(arr)
# Clean up
del arr, audio_data
torch.cuda.empty_cache()
mem_usage = memory_usage(max_usage=True) # Get max memory usage in MB
logger.info(f"Current memory usage in transciber after generating output: {mem_usage} MB")
In above code there is a increment of around 40MB of memory every time this is being transcribed which keeps on adding and increasing sometimes the increase is around 200MB.
If someone have any idea about this, please help