Whisper fine tuning on custom audio data

@sanchit-gandhi, Thanks for your quick reply. That issue i have resolved but i want to ask how to fine tune whisper with audios which are longer than 30 sec. Is it done internally while training or we need to feed the audios with in 30 sec limit. Thanks.

Hey @Deveshp! Unfortunately, the Whisper model forces the audio samples to be padded/truncated to 30s: Fine-Tune Whisper For Multilingual ASR with 🤗 Transformers

Note though that this requirement shouldn’t stop you from using the Whisper model for transcribing audio samples longer than 30s at inference time! See this Colab for an explanation: Google Colab

@sanchit-gandhi Do you know if there are any existing pre-processing solutions for chunking long audio + transcript pairs down to <30sec? I have started on the project below, but if there is a solution somewhere already it would be a huge help:

I am trying to pre-process a dataset for fine-tuning whisper. The dataset contains a mp3 narration and ground truth transcript. However, the recordings are much greater than the 30 sec limit for fine-tuning whisper. The dataset I have contains many >45min recordings and ground-truth transcriptions that are NOT timestamped. My current approach is:

  1. Split audio into chunks <30 sec at silence points
  2. Run ASR on each chunk using off-the-shelf whisper-large-v3
  3. Use similarity to compare each transcription chunk (with errors) to full ground truth transcription.
  4. For similarity overlap, copy the text out of the ground truth transcription and save as ground-truth transcription for each chunk

I have completed steps 1 and 2. Currently struggling with #3. Also, I tried off-the-shelf whisper-large-v3 transcription with timestamps on a full recording, but timestamps are not correct-they do not align with the audio so I cannot use those timestamps as audio chunking method. I’m currently experimenting with Mistral-7B-instruct prompts for the similarly step and creating the corrected chunks.