Different inference speed for finetuned Whisper models

tadinda · February 28, 2024, 11:05am

Hi there, I have a question about finetuning Whisper models and I hope someone can aid my understanding.

I have finetuned two Whisper models (small checkpoint), using the tutorial Fine-Tune Whisper For Multilingual ASR with 🤗 Transformers. One model was finetuned on a German dataset of ~66hrs of audio and one on a Dutch dataset of ~120hrs. Both datasets have the exact same audio characteristics, and the audio segments are of roughly the same length. I used the exact same finetuning parameters for both training runs.

However, the German model is way faster (and more comparable to the out-of-the-box OpenAI model) than the Dutch model. I am running decoding on CPU using faster-whisper for multiple files in parallel using multiprocessing and this difference increases exponentially. When I decode one file (20s) the German model takes 46 seconds and the Dutch model 50, when I decode 46 files (1.45h) the difference is 14 minutes vs. 25 minutes.

My question is: what can be the reasons for these differences?

Are there any factors that have a large impact on inference speed besides the obvious model size and training parameters (which I kept the same in my case)? And am I correct to assume that the pure amount of training data should not have an influence on inference speed, since this does not change the amount of parameters?

Topic		Replies	Views
Whisper decoder is slow for ASR task 🤗Transformers	3	1922	November 26, 2023
Fine tuning whisper on custom dataset Beginners	3	927	January 11, 2024
Whisper fine-tuning slow eval Models	0	449	February 28, 2024
Help about Whisper chunk_length Beginners	1	157	February 15, 2025
Problem with finetuning model whisper Beginners	0	86	November 7, 2024

Different inference speed for finetuned Whisper models

Related topics