Poor Real-Time Performance of Whisper Models Fine-Tuned on Synthetic Data #198

I have custom text data for plant disease names and plant names like this:

uuid, context 
1er1hhaj13, The Rhododendron, a popular ornamental plant, often suffers from Phytophthora ramorum, a challenging disease to manage and pronounce. This pathogen causes Sudden Oak Death, which can lead to extensive damage and mortality in infected plants.

I used speech-to-text APIs to convert this context into audio WAV files, choosing 10 speakers with mostly American/UK/British accents. So I created around ~5k samples for training and ~2k samples for testing.

I followed the same steps from “Fast whisper finetuning” to finetune the peft version of Whisper Large-v2. The training and validation loss looks good:

Step | Training Loss | Validation Loss
250 | 0.413000 | 0.102663
500 | 0.109900 | 0.130888
750 | 0.116500 | 0.102719
1000 | 0.092800 | 0.099153
1250 | 0.068800 | 0.075613 
1500 | 0.042500 | 0.085680
1750 | 0.047500 | 0.076951
2000 | 0.027500 | 0.065127
2250 | 0.023700 | 0.061832
2500 | 0.012500 | 0.062658
2750 | 0.011500 | 0.061922
3000 | 0.008500 | 0.061463
3250 | 0.005300 | 0.060227
3500 | 0.003800 | 0.060712
3750 | 0.002700 | 0.060332
4000 | 0.002300 | 0.060496

When I calculated WER on the test data:

  • OpenAI Whisper APIs: 22.03 WER on test data
  • Finetuned model: 0.3 WER on test data

Which looks good. However, during real-time testing with an Indian English-speaking audience, the accuracy for plant names and disease names was not satisfactory. What strategies could we employ to improve accuracy in real-time settings?
Any guidance or suggestions on this matter would be greatly appreciated. Thank you!