Speed up whisper batched inference

augustomaillo · September 16, 2024, 5:56pm

I’m deep diving into whisper implementation in transformers library. In my application, speed is crucial so I’m wondering what can I do to speed up.

I’m using batched inference, but since hallucination makes inference a lot slower, when some sample inside the batch hallucinate with a token repetition, the whole batch will took a lot of extra time to generate a response.

I’m using a custom streamer to return the finished predictions when the EOS_TOKEN is predicted, but I’m wondering if I can drop out them from the batch in order to speed up the remaining samples inference. Does it make sense to you? Should I expect a speed improvement with this approach?

Also, can I go further and create a “livid batch” approach, where I keep dropping and inserting samples in the batch while they’re finishing? For exemplo, with a batch of size 16, if some sample got eos_token prediction, I drop it from the batch and insert a new sample in its place. Does it make sense?

Topic		Replies	Views
Mixtral batch inference or in general fast inference Beginners	2	3925	February 26, 2024
Different inference speed for finetuned Whisper models Beginners	0	397	February 28, 2024
Confusing Benchmark results Running whisper on 4080 Super vs A10 vs H100 🤗Transformers	0	459	April 22, 2024
Deploying Whisper Based Live Transcription for 1000 Concurrent users Intermediate	0	351	June 1, 2024
How to optimise transformer speed for batches of inputs? 🤗Transformers	0	257	June 7, 2021

Speed up whisper batched inference

Related topics