@sanchit-gandhi, Thanks for your quick reply. That issue i have resolved but i want to ask how to fine tune whisper with audios which are longer than 30 sec. Is it done internally while training or we need to feed the audios with in 30 sec limit. Thanks.
Hey @Deveshp! Unfortunately, the Whisper model forces the audio samples to be padded/truncated to 30s: Fine-Tune Whisper For Multilingual ASR with 🤗 Transformers
Note though that this requirement shouldn’t stop you from using the Whisper model for transcribing audio samples longer than 30s at inference time! See this Colab for an explanation: Google Colab
@sanchit-gandhi Do you know if there are any existing pre-processing solutions for chunking long audio + transcript pairs down to <30sec? I have started on the project below, but if there is a solution somewhere already it would be a huge help:
I am trying to pre-process a dataset for fine-tuning whisper. The dataset contains a mp3 narration and ground truth transcript. However, the recordings are much greater than the 30 sec limit for fine-tuning whisper. The dataset I have contains many >45min recordings and ground-truth transcriptions that are NOT timestamped. My current approach is:
- Split audio into chunks <30 sec at silence points
- Run ASR on each chunk using off-the-shelf whisper-large-v3
- Use similarity to compare each transcription chunk (with errors) to full ground truth transcription.
- For similarity overlap, copy the text out of the ground truth transcription and save as ground-truth transcription for each chunk
I have completed steps 1 and 2. Currently struggling with #3. Also, I tried off-the-shelf whisper-large-v3 transcription with timestamps on a full recording, but timestamps are not correct-they do not align with the audio so I cannot use those timestamps as audio chunking method. I’m currently experimenting with Mistral-7B-instruct prompts for the similarly step and creating the corrected chunks.
If your ground truth is in SRT format, try using the following code:
def group_subtitles_by_duration(self, subtitles: list, target_duration_ms: int = 30000) -> list:
"""
Group SRT subtitles sequentially so that the total duration of each group
(from the start of the first subtitle to the end of the last subtitle) does not exceed target_duration_ms.
(The audio and text remain well-aligned since we split at SRT boundaries.)
"""
chunks = []
current_chunk = []
current_start = None
current_end = None
for sub in subtitles:
if not current_chunk:
current_chunk.append(sub)
current_start = sub['start']
current_end = sub['end']
else:
# Include the new subtitle if the total duration stays within the target limit
if sub['end'] - current_start <= target_duration_ms:
current_chunk.append(sub)
current_end = sub['end']
else:
chunks.append({'start': current_start, 'end': current_end, 'subs': current_chunk})
current_chunk = [sub]
current_start = sub['start']
current_end = sub['end']
if current_chunk:
chunks.append({'start': current_start, 'end': current_end, 'subs': current_chunk})
return chunks
However, if you only have plain text, you can follow a similar approach to what you’ve already done:
- Use Whisper-large-v3 to transcribe the audio and generate an SRT file (you can use Whisper-WebUI for this).
- Compare the generated SRT transcription with the ground truth text to adjust and correct the SRT accordingly.
- Apply the code above to utilize the SRT timestamps.
I found that this method works relatively well for alignment.
Thanks for sharing. It helps me a lot.