I am interested in using Whisper to translate some Farsi video/audio for my wife. The first pass with large-v3 was not as good as we had hoped. Thus leading me to HF to learn to fine tune Whisper.
I did a quick run down through some of the common_voice_16_0 transcripts and immediately my wife pointed out that a lot of the segments within the original “transcript_fa_train.tsv” file had spelling errors and some of the segments didn’t exactly match the audio. For example: ‘cannot’ ≠ ‘can not’ or ‘gooood byeeee’ ≠ ‘good bye’
This has made me leery of spending time (and money) fine tuning Whisper models on common voice Farsi if the input files are garbage.
Is this an issue: errors in the training files for fine tuning Whisper? I assumed the transcripts provided with common_voice were to be treated as 100% correct.