I have fine-tuned a Whisper-small model using this guide Fine-tuning the ASR model - Hugging Face Audio Course and have observed that the eval steps seem to be much slower (about a factor of 2-3) than the training steps. This seems very strange to me as I would have though that the eval steps just do a forward pass that would also happen in the training steps, so they should not be any slower. I am using the same batch size for both training and eval, not doing beam search, and the decoding and WER evaluation is not included in the eval step time. While it does not influence the end result I would still like to understand the training process better.
5 Likes
I’m facing the same issue for fine-tuning whisper. The evaluation steps are much slower and the GPU is not being utilized much. I suspect it’s a CPU bottleneck.
Any updates or fixes on this?
1 Like
I found the reason for the bottleneck.
The slowdown is due to the jiwer library which is used for calculating metrics. Although jiwer uses rapidfuzz which is written in cpp and is quit fast, still its mostly in python and is CPU-bound which bottlenecks the GPU. I overcame this by using eval_loss
as my main metric and then calculating string metrics in an offline setup.
1 Like