I’m trying to finetune Whisper on some specific terminology, and I’m wondering if the clips can be less than 30 seconds for finetuning? I remember seeing something about it needing to be 30s exactly elsewhere but can’t find the info on it anymore.
Thank you
(Also, I’m using a metadata csv with the transcription and pathway to audio using: Create an audio dataset as my guide.)