I have followed this blog to finetune the ASR model.
The training is working fine. However, the decoding time is very slow.
Are there hyperparameters to be optimized for speeding up the decoder of Whisper?
Or is there a possibility to customize the decoder of Whisper?
Seq2Seq models perform generate text through autoregressive generation of the decoder (see Fine-Tune Whisper For Multilingual ASR with 🤗 Transformers for details). So, we perform as many forward passes of the decoder as tokens generated.
Running generation in “greedy” mode will be much faster than beam search (we use greedy by default).
You can also explore reducing the “max_length”:
model.config.max_length = 100
Will generate 100 tokens max. But this will almost certainly reduce your overall performance, as you’ll truncate some sentences short.
Are you running inference on GPU? It shouldn’t be too slow with the “small” checkpoint on most GPU devices!
Alternatively, you can try training one of the smaller checkpoints (“base” or “tiny”) for faster inference.
Thanks for your suggestions.
I will try again and come back soon.