Im trying to fine tune a Whisper model with WhisperForAudioClassification head using Huggingface transformers.
As a templeate im using ASR fiunetuning pipeline, as shown here: https://huggingface.co/blog/fine-tune-whisper
What parts of the pipeline need to be changed?
Is the “prepare_dataset” function going to change other than that the tokenizer input will be class word instead of sentence?
Do you still need a collator function ?
Is the metric going to be accuracy ?
Do you still use Seq2seq trainer ?
I would like to get to a result like this with a custom dataset: sanchit-gandhi/whisper-tiny-ft-keyword-spotting
Finetuning Whisper with classification head goes the same as other classification models, no matter the seq2seq architecture.
Only difference from finetuning for example Hubert is not using “attention_mask” and don’t limit the feature extractors max_length - the inputs shall be 30s.
Thank you for sharing.
In some classification tasks (like let s say i want to see if an audio is “happy” or “sad”), some inner representations, like representations from the encoder of whisper, will likely be better at classifying than the model output.
Therefore it would be interesting to train only a subpart of the model, by adding a classification layer after the layer we’re interested in.
Have you tried doing that ? do you know if it is doable within the huggingface framework ?