Deploying Whisper Based Live Transcription for 1000 Concurrent users

orgho98 · June 1, 2024, 8:48pm

Hi everyone,

I want to use Whisper Medium or Large Model to run Live inferencing Via Websockets and make the model available for 1000+ concurrent users.

Are there any resources on how i might do this on GPUs effeciently, or what are some ways to incorporate dynamic batching and how HF library can help with the same? I’m using Whisper Streaming currently. But it’s not designed to scale to more than 2-3 concurrent requests on a GPU

Topic		Replies	Views
Parallelise pipelines on a single GPU? Intermediate	3	749	October 31, 2024
Help for using whisper with embeddings Models	1	424	November 22, 2023
Asynchronous CPU-GPU computation Beginners	0	346	March 15, 2024
Cuda out of memory issue training whisper model on single GPU Intermediate	0	907	December 15, 2023
Support for ASR inference on longer audiofiles or on live transcription? 🤗Transformers	2	473	April 21, 2023

Deploying Whisper Based Live Transcription for 1000 Concurrent users

Related topics