@sanchit-gandhi, Thanks for your quick reply. That issue i have resolved but i want to ask how to fine tune whisper with audios which are longer than 30 sec. Is it done internally while training or we need to feed the audios with in 30 sec limit. Thanks.
Hey @Deveshp! Unfortunately, the Whisper model forces the audio samples to be padded/truncated to 30s: Fine-Tune Whisper For Multilingual ASR with 🤗 Transformers
Note though that this requirement shouldn’t stop you from using the Whisper model for transcribing audio samples longer than 30s at inference time! See this Colab for an explanation: Google Colab