Preparing audio and transcripts for fine-tuning Whisper

I have a set of audio files and their accompanying transcripts (stored as text files), totaling about 13 hours of audio. I plan to use these to fine-tune Whisper, which requires breaking them up into chunks of 30s or less. I know pydub’s split_on_silence can be used to split the audio files along periods of silence, to ensure words are not cut-off, but I am unsure how I can split the transcripts to ensure they line up with the audio splits. Are there any existing tools to help with this, or would it need to be done manually?

1 Like

Hmm… How about simply using chunking setting built in Whisper?

It seems like those are for producing timestamps for each word generated during inference. If I already have audio and transcripts, I assume that would require using Whisper to re-transcribe each audio file, no?

1 Like

Oh… True…:sweat_smile:
It’s really old, but how about aeneas?
There may be some useful utilities in the torchaudio library.