Whisper for Audio Classification

Teapack1 · November 15, 2023, 3:32pm

Hello all,

Im trying to fine tune a Whisper model with WhisperForAudioClassification head using Huggingface transformers.

As a templeate im using ASR fiunetuning pipeline, as shown here:
https://huggingface.co/blog/fine-tune-whisper

What parts of the pipeline need to be changed?
Is the “prepare_dataset” function going to change other than that the tokenizer input will be class word instead of sentence?
Do you still need a collator function ?
Is the metric going to be accuracy ?
Do you still use Seq2seq trainer ?

I would like to get to a result like this with a custom dataset: sanchit-gandhi/whisper-tiny-ft-keyword-spotting

Thank you !

Teapack1 · November 15, 2023, 7:17pm

Colab code:
https://colab.research.google.com/drive/1nU6dlYamT32kfLe2t_AytmOPRjaOxOZn?usp=sharing

The error after training.train() seems to be that the target batch_size is always 6x bigger than my input batch_size.
error message:


ValueError                                Traceback (most recent call last)
<ipython-input-63-3435b262f1ae> in <cell line: 1>()
----> 1 trainer.train()

31 frames
/usr/local/lib/python3.10/dist-packages/torch/nn/functional.py in cross_entropy(input, target, weight, size_average, ignore_index, reduce, reduction, label_smoothing)
   3051     if size_average is not None or reduce is not None:
   3052         reduction = _Reduction.legacy_get_string(size_average, reduce)
-> 3053     return torch._C._nn.cross_entropy_loss(input, target, weight, _Reduction.get_enum(reduction), ignore_index, label_smoothing)
   3054 
   3055 

ValueError: Expected input batch_size (8) to match target batch_size (48).

Teapack1 · November 16, 2023, 11:43pm

Hello,
I solved this issue.

Finetuning Whisper with classification head goes the same as other classification models, no matter the seq2seq architecture.

Only difference from finetuning for example Hubert is not using “attention_mask” and don’t limit the feature extractors max_length - the inputs shall be 30s.

The colab notebook is updated now.

mdege · October 9, 2024, 2:05pm

Thank you for sharing.
In some classification tasks (like let s say i want to see if an audio is “happy” or “sad”), some inner representations, like representations from the encoder of whisper, will likely be better at classifying than the model output.
Therefore it would be interesting to train only a subpart of the model, by adding a classification layer after the layer we’re interested in.
Have you tried doing that ? do you know if it is doable within the huggingface framework ?

Topic		Replies	Views
Fine-tuning Whisper for Audio Classification Models	6	3267	November 8, 2024
RuntimeError: The size of tensor a (553) must match the size of tensor b (448) at non-singleton dimension 1 Beginners	3	1087	July 17, 2024
Whisper fine-tuning on Librispeech makes WER worse 🤗Transformers	6	2413	June 26, 2023
Whisper-small outputs the same results after fine-tuning Beginners	0	206	November 22, 2023
Finetuned whisper model translating instead of transcribing 🤗Transformers	2	734	December 31, 2023

Whisper for Audio Classification

Related topics