We have several data sources for whisper fine-tuning, so we have two options:
Merge: Convert/merge the datasets and fine-tune on them
Chain: Fine-tune on dataset DS1, then from the best checkpoint fine-tune on DS2 etc…
I’m thinking about stuff like “dataset cleanness”, audio duration differences / possible chunking, keeping the computer on for many days because the merged dataset is large etc.
What is the proper/suggested method for fine-tuning transformers models?
Hello, did you ever find a good solution to the options to outlined? Would love to learn more.
Unfortunately, I did not have time to try them yet. My current work is about finetuning several languages with Common Voice, using different splitting algorithms. It is because CV is not used for training Whisper.
It is important to have distinct voices in train-dev-test splits
Therefore I decided NOT to merge them but use the other datasets as test datasets.
I’ll go further with nVidia models for comparison.
Unfortunately, the dataset cards of all those models are not detailed enough to show me what voices/sentences/utterances are used during training. It would be worse if I used them again in fine-tuning.
If you are sure this is not the case, my opinion is to merge them before fine-tuning, for better results. During chaining, you start from a more “fixated knowledge state” and the order will also become another parameter.