Whisper for Audio Classification

Hello all,

Im trying to fine tune a Whisper model with WhisperForAudioClassification head using Huggingface transformers.

As a templeate im using ASR fiunetuning pipeline, as shown here:

What parts of the pipeline need to be changed?
Is the “prepare_dataset” function going to change other than that the tokenizer input will be class word instead of sentence?
Do you still need a collator function ?
Is the metric going to be accuracy ?
Do you still use Seq2seq trainer ?

I would like to get to a result like this with a custom dataset: sanchit-gandhi/whisper-tiny-ft-keyword-spotting

Thank you !

Colab code:

The error after training.train() seems to be that the target batch_size is always 6x bigger than my input batch_size.
error message:

ValueError                                Traceback (most recent call last)
<ipython-input-63-3435b262f1ae> in <cell line: 1>()
----> 1 trainer.train()

31 frames
/usr/local/lib/python3.10/dist-packages/torch/nn/functional.py in cross_entropy(input, target, weight, size_average, ignore_index, reduce, reduction, label_smoothing)
   3051     if size_average is not None or reduce is not None:
   3052         reduction = _Reduction.legacy_get_string(size_average, reduce)
-> 3053     return torch._C._nn.cross_entropy_loss(input, target, weight, _Reduction.get_enum(reduction), ignore_index, label_smoothing)

ValueError: Expected input batch_size (8) to match target batch_size (48).

I solved this issue.

Finetuning Whisper with classification head goes the same as other classification models, no matter the seq2seq architecture.

Only difference from finetuning for example Hubert is not using “attention_mask” and don’t limit the feature extractors max_length - the inputs shall be 30s.

The colab notebook is updated now.

1 Like