Whisper for Audio Classification

Hello all,

Im trying to fine tune a Whisper model with WhisperForAudioClassification head using Huggingface transformers.

As a templeate im using ASR fiunetuning pipeline, as shown here:
https://huggingface.co/blog/fine-tune-whisper

What parts of the pipeline need to be changed?
Is the “prepare_dataset” function going to change other than that the tokenizer input will be class word instead of sentence?
Do you still need a collator function ?
Is the metric going to be accuracy ?
Do you still use Seq2seq trainer ?

I would like to get to a result like this with a custom dataset: sanchit-gandhi/whisper-tiny-ft-keyword-spotting

Thank you !

2 Likes

Colab code:
https://colab.research.google.com/drive/1nU6dlYamT32kfLe2t_AytmOPRjaOxOZn?usp=sharing

The error after training.train() seems to be that the target batch_size is always 6x bigger than my input batch_size.
error message:


ValueError                                Traceback (most recent call last)
<ipython-input-63-3435b262f1ae> in <cell line: 1>()
----> 1 trainer.train()

31 frames
/usr/local/lib/python3.10/dist-packages/torch/nn/functional.py in cross_entropy(input, target, weight, size_average, ignore_index, reduce, reduction, label_smoothing)
   3051     if size_average is not None or reduce is not None:
   3052         reduction = _Reduction.legacy_get_string(size_average, reduce)
-> 3053     return torch._C._nn.cross_entropy_loss(input, target, weight, _Reduction.get_enum(reduction), ignore_index, label_smoothing)
   3054 
   3055 

ValueError: Expected input batch_size (8) to match target batch_size (48).
1 Like

Hello,
I solved this issue.

Finetuning Whisper with classification head goes the same as other classification models, no matter the seq2seq architecture.

Only difference from finetuning for example Hubert is not using “attention_mask” and don’t limit the feature extractors max_length - the inputs shall be 30s.

The colab notebook is updated now.

2 Likes

Thank you for sharing.
In some classification tasks (like let s say i want to see if an audio is “happy” or “sad”), some inner representations, like representations from the encoder of whisper, will likely be better at classifying than the model output.
Therefore it would be interesting to train only a subpart of the model, by adding a classification layer after the layer we’re interested in.
Have you tried doing that ? do you know if it is doable within the huggingface framework ?