Fine tuning whisper on custom dataset

byoussef · March 8, 2023, 1:49am

I’m trying to fine tune whisper large v2 on a custom dataset of over 7000 hours of speech. However, the audio files are very long since they are recordings of news reports, radio broadcasts, conferences… etc. I think on average most files are over an hour long
is it possible or do I have to split them to 30 seconds? and if I have to, please advise me on an efficient way to process that much data quick…
I have access to a server with 8x A100 GPUs, so memory shouldn’t be a problem

iam-pia · January 11, 2024, 12:46pm

Hi @byoussef, I have the same problem. Did you find a solution or a smart way to solve your problem?

byoussef · January 11, 2024, 2:46pm

Hi @iam-pia
Sorry to hear you’re facing the same issue. Fortunately for me the transcription json files had timestamps based on utterances so I just wrote a script to chunk up the audios based on those timestamps and extract the relevant utterances from the transcript.

iam-pia · January 11, 2024, 5:17pm

Ahh thats lucky! Thanks for the reply

Topic		Replies	Views
Whisper fine tuning on custom audio data Beginners	4	2717	February 15, 2025
Custom Training Set for Whisper - can it be < 30s clips? Beginners	0	102	July 11, 2024
How to fit custom audio dataset during pre-process? Batch? Stream? Shard? Beginners	1	252	May 26, 2023
Help needed with issues while trying fine-tune Whisper Beginners	2	1400	April 19, 2024
Whisper fine-tuning and retaining timestamp decoding Models	5	1321	December 12, 2024

Fine tuning whisper on custom dataset

Related topics