I have a set of audio files and their accompanying transcripts (stored as text files), totaling about 13 hours of audio. I plan to use these to fine-tune Whisper, which requires breaking them up into chunks of 30s or less. I know pydub’s split_on_silence can be used to split the audio files along periods of silence, to ensure words are not cut-off, but I am unsure how I can split the transcripts to ensure they line up with the audio splits. Are there any existing tools to help with this, or would it need to be done manually?
1 Like
Hmm… How about simply using chunking setting built in Whisper?
It seems like those are for producing timestamps for each word generated during inference. If I already have audio and transcripts, I assume that would require using Whisper to re-transcribe each audio file, no?
1 Like
Oh… True…
It’s really old, but how about aeneas
?
There may be some useful utilities in the torchaudio library.