Preparing audio and transcripts for fine-tuning Whisper

efaith2000 · July 21, 2025, 6:38pm

I have a set of audio files and their accompanying transcripts (stored as text files), totaling about 13 hours of audio. I plan to use these to fine-tune Whisper, which requires breaking them up into chunks of 30s or less. I know pydub’s split_on_silence can be used to split the audio files along periods of silence, to ensure words are not cut-off, but I am unsure how I can split the transcripts to ensure they line up with the audio splits. Are there any existing tools to help with this, or would it need to be done manually?

John6666 · July 22, 2025, 1:15pm

Hmm… How about simply using chunking setting built in Whisper?

efaith2000 · July 22, 2025, 2:55pm

It seems like those are for producing timestamps for each word generated during inference. If I already have audio and transcripts, I assume that would require using Whisper to re-transcribe each audio file, no?

John6666 · July 22, 2025, 11:03pm

Oh… True…
It’s really old, but how about aeneas?
There may be some useful utilities in the torchaudio library.

Topic		Replies	Views
Whisper fine tuning on custom audio data Beginners	4	2742	February 15, 2025
Fine tuning whisper on custom dataset Beginners	3	934	January 11, 2024
Whisper on long audio files -- support for chunking? 🤗Transformers	3	5743	April 21, 2023
Help about Whisper chunk_length Beginners	1	186	February 15, 2025
Don't know where to start. Please help manipulating transcribed audio Beginners	0	204	March 11, 2024

Preparing audio and transcripts for fine-tuning Whisper

Related topics