Whisper fine tuning on custom audio data

Deveshp · December 21, 2022, 9:33am

@sanchit-gandhi, Thanks for your quick reply. That issue i have resolved but i want to ask how to fine tune whisper with audios which are longer than 30 sec. Is it done internally while training or we need to feed the audios with in 30 sec limit. Thanks.

sanchit-gandhi · December 21, 2022, 4:34pm

Hey @Deveshp! Unfortunately, the Whisper model forces the audio samples to be padded/truncated to 30s: Fine-Tune Whisper For Multilingual ASR with 🤗 Transformers

Note though that this requirement shouldn’t stop you from using the Whisper model for transcribing audio samples longer than 30s at inference time! See this Colab for an explanation: Google Colab

mmcgovern574 · February 25, 2024, 6:45pm

@sanchit-gandhi Do you know if there are any existing pre-processing solutions for chunking long audio + transcript pairs down to <30sec? I have started on the project below, but if there is a solution somewhere already it would be a huge help:

I am trying to pre-process a dataset for fine-tuning whisper. The dataset contains a mp3 narration and ground truth transcript. However, the recordings are much greater than the 30 sec limit for fine-tuning whisper. The dataset I have contains many >45min recordings and ground-truth transcriptions that are NOT timestamped. My current approach is:

Split audio into chunks <30 sec at silence points
Run ASR on each chunk using off-the-shelf whisper-large-v3
Use similarity to compare each transcription chunk (with errors) to full ground truth transcription.
For similarity overlap, copy the text out of the ground truth transcription and save as ground-truth transcription for each chunk

I have completed steps 1 and 2. Currently struggling with #3. Also, I tried off-the-shelf whisper-large-v3 transcription with timestamps on a full recording, but timestamps are not correct-they do not align with the audio so I cannot use those timestamps as audio chunking method. I’m currently experimenting with Mistral-7B-instruct prompts for the similarly step and creating the corrected chunks.

hwanython · February 15, 2025, 4:23am

If your ground truth is in SRT format, try using the following code:

def group_subtitles_by_duration(self, subtitles: list, target_duration_ms: int = 30000) -> list:
    """
    Group SRT subtitles sequentially so that the total duration of each group 
    (from the start of the first subtitle to the end of the last subtitle) does not exceed target_duration_ms.
    (The audio and text remain well-aligned since we split at SRT boundaries.)
    """
    chunks = []
    current_chunk = []
    current_start = None
    current_end = None

    for sub in subtitles:
        if not current_chunk:
            current_chunk.append(sub)
            current_start = sub['start']
            current_end = sub['end']
        else:
            # Include the new subtitle if the total duration stays within the target limit
            if sub['end'] - current_start <= target_duration_ms:
                current_chunk.append(sub)
                current_end = sub['end']
            else:
                chunks.append({'start': current_start, 'end': current_end, 'subs': current_chunk})
                current_chunk = [sub]
                current_start = sub['start']
                current_end = sub['end']
    if current_chunk:
        chunks.append({'start': current_start, 'end': current_end, 'subs': current_chunk})
    return chunks

However, if you only have plain text, you can follow a similar approach to what you’ve already done:

Use Whisper-large-v3 to transcribe the audio and generate an SRT file (you can use Whisper-WebUI for this).
Compare the generated SRT transcription with the ground truth text to adjust and correct the SRT accordingly.
Apply the code above to utilize the SRT timestamps.

I found that this method works relatively well for alignment.

benstokes · February 15, 2025, 11:21am

Thanks for sharing. It helps me a lot.

Topic		Replies	Views
Fine tuning whisper on custom dataset Beginners	3	950	January 11, 2024
Speech recognition processing max_length Beginners	9	435	October 19, 2024
Whisper on long audio files -- support for chunking? 🤗Transformers	3	5812	April 21, 2023
Preparing audio and transcripts for fine-tuning Whisper Models	3	31	July 22, 2025
Duration of audio sequence ingested by Whisper Inference Endpoints on the Hub	2	1703	January 17, 2023

Whisper fine tuning on custom audio data

Related topics