Fine-tuning whisper on sound-event-detection dataset

I have a dataset of whale calls annotated like "<|0.00||> call3 <|1.20|> <|3.40|> call4<|3.80|> <|4.40|> call4 <| 5.2|>"

There are ~7000 annotated 15 second clips made up of about 40 annotated calls.
Following some of the other tutorials, I appended the call tokens to the english tokenizer and fine-tuned the model.

Across a wide amount of hyper-parameters, the generation is usually constrained to 1 single call. We cannot even overfit to the training set.

The WER is at best 86.7. Guessing the most frequent call for every event produces a WER of 86.2, so we are worse than random.

Whats baffling is that the whisper encoder is great at classifying calls (top-1 accuracy ~40% and top-5 accuracy ~80%). The calls also occur in specific orders. for example one call (say call 5) is exclusively only used with call 6, but call 6 is frequently used other calls. So a model which takes into context the whole past should produce a better prediction than history-blinded classifiers.

Does anyone with experience fine-tuning whisper have a pulse on whether this is a 1) data scale issue, 2) generation config issue, or 3) training config issue. Again, this is not a hard task and the various calls are easily separable by the whisper encoder. Training loss and validation loss are both decreasing. Early stopping on the validation loss is used.

1 Like