Fine-tuning whisper on sound-event-detection dataset

bnestor · December 20, 2024, 11:16pm

I have a dataset of whale calls annotated like "<|0.00||> call3 <|1.20|> <|3.40|> call4<|3.80|> <|4.40|> call4 <| 5.2|>"

There are ~7000 annotated 15 second clips made up of about 40 annotated calls.
Following some of the other tutorials, I appended the call tokens to the english tokenizer and fine-tuned the model.

Across a wide amount of hyper-parameters, the generation is usually constrained to 1 single call. We cannot even overfit to the training set.

The WER is at best 86.7. Guessing the most frequent call for every event produces a WER of 86.2, so we are worse than random.

Whats baffling is that the whisper encoder is great at classifying calls (top-1 accuracy ~40% and top-5 accuracy ~80%). The calls also occur in specific orders. for example one call (say call 5) is exclusively only used with call 6, but call 6 is frequently used other calls. So a model which takes into context the whole past should produce a better prediction than history-blinded classifiers.

Does anyone with experience fine-tuning whisper have a pulse on whether this is a 1) data scale issue, 2) generation config issue, or 3) training config issue. Again, this is not a hard task and the various calls are easily separable by the whisper encoder. Training loss and validation loss are both decreasing. Early stopping on the validation loss is used.

Topic		Replies	Views
Fine Tuning Whisper on my own Dataset with a customized Tokenizer Beginners	16	12626	February 12, 2024
Whisper fine-tuning on Librispeech makes WER worse 🤗Transformers	6	2494	June 26, 2023
Whisper for Audio Classification 🤗Transformers	3	2986	October 9, 2024
Fintune whisper model returns exclamation marks 🤗Transformers	1	569	August 7, 2023
About finetuning whisper 🤗Transformers	0	211	May 5, 2023

Fine-tuning whisper on sound-event-detection dataset

Related topics