Fine tuning whisper on custom dataset

I’m trying to fine tune whisper large v2 on a custom dataset of over 7000 hours of speech. However, the audio files are very long since they are recordings of news reports, radio broadcasts, conferences… etc. I think on average most files are over an hour long
is it possible or do I have to split them to 30 seconds? and if I have to, please advise me on an efficient way to process that much data quick…
I have access to a server with 8x A100 GPUs, so memory shouldn’t be a problem

Hi @byoussef, I have the same problem. Did you find a solution or a smart way to solve your problem?

Hi @iam-pia
Sorry to hear you’re facing the same issue. Fortunately for me the transcription json files had timestamps based on utterances so I just wrote a script to chunk up the audios based on those timestamps and extract the relevant utterances from the transcript.

Ahh thats lucky! Thanks for the reply :slight_smile: