Deploying Whisper Based Live Transcription for 1000 Concurrent users

Hi everyone,

I want to use Whisper Medium or Large Model to run Live inferencing Via Websockets and make the model available for 1000+ concurrent users.

Are there any resources on how i might do this on GPUs effeciently, or what are some ways to incorporate dynamic batching and how HF library can help with the same? I’m using Whisper Streaming currently. But it’s not designed to scale to more than 2-3 concurrent requests on a GPU

1 Like