Whisper fine-tuning slow eval

I have fine-tuned a Whisper-small model using this guide Fine-tuning the ASR model - Hugging Face Audio Course and have observed that the eval steps seem to be much slower (about a factor of 2-3) than the training steps. This seems very strange to me as I would have though that the eval steps just do a forward pass that would also happen in the training steps, so they should not be any slower. I am using the same batch size for both training and eval, not doing beam search, and the decoding and WER evaluation is not included in the eval step time. While it does not influence the end result I would still like to understand the training process better.