Hi there, I have a question about finetuning Whisper models and I hope someone can aid my understanding.
I have finetuned two Whisper models (small checkpoint), using the tutorial Fine-Tune Whisper For Multilingual ASR with 🤗 Transformers. One model was finetuned on a German dataset of ~66hrs of audio and one on a Dutch dataset of ~120hrs. Both datasets have the exact same audio characteristics, and the audio segments are of roughly the same length. I used the exact same finetuning parameters for both training runs.
However, the German model is way faster (and more comparable to the out-of-the-box OpenAI model) than the Dutch model. I am running decoding on CPU using faster-whisper for multiple files in parallel using multiprocessing
and this difference increases exponentially. When I decode one file (20s) the German model takes 46 seconds and the Dutch model 50, when I decode 46 files (1.45h) the difference is 14 minutes vs. 25 minutes.
My question is: what can be the reasons for these differences?
Are there any factors that have a large impact on inference speed besides the obvious model size and training parameters (which I kept the same in my case)? And am I correct to assume that the pure amount of training data should not have an influence on inference speed, since this does not change the amount of parameters?