I’m deep diving into whisper implementation in transformers library. In my application, speed is crucial so I’m wondering what can I do to speed up.
I’m using batched inference, but since hallucination makes inference a lot slower, when some sample inside the batch hallucinate with a token repetition, the whole batch will took a lot of extra time to generate a response.
I’m using a custom streamer to return the finished predictions when the EOS_TOKEN is predicted, but I’m wondering if I can drop out them from the batch in order to speed up the remaining samples inference. Does it make sense to you? Should I expect a speed improvement with this approach?
Also, can I go further and create a “livid batch” approach, where I keep dropping and inserting samples in the batch while they’re finishing? For exemplo, with a batch of size 16, if some sample got eos_token
prediction, I drop it from the batch and insert a new sample in its place. Does it make sense?