Hello all,
Im trying to fine tune a Whisper model with WhisperForAudioClassification head using Huggingface transformers.
As a templeate im using ASR fiunetuning pipeline, as shown here:
https://huggingface.co/blog/fine-tune-whisper
What parts of the pipeline need to be changed?
Is the “prepare_dataset” function going to change other than that the tokenizer input will be class word instead of sentence?
Do you still need a collator function ?
Is the metric going to be accuracy ?
Do you still use Seq2seq trainer ?
I would like to get to a result like this with a custom dataset: sanchit-gandhi/whisper-tiny-ft-keyword-spotting
Thank you !
Colab code:
https://colab.research.google.com/drive/1nU6dlYamT32kfLe2t_AytmOPRjaOxOZn?usp=sharing
The error after training.train() seems to be that the target batch_size is always 6x bigger than my input batch_size.
error message:
ValueError Traceback (most recent call last)
<ipython-input-63-3435b262f1ae> in <cell line: 1>()
----> 1 trainer.train()
31 frames
/usr/local/lib/python3.10/dist-packages/torch/nn/functional.py in cross_entropy(input, target, weight, size_average, ignore_index, reduce, reduction, label_smoothing)
3051 if size_average is not None or reduce is not None:
3052 reduction = _Reduction.legacy_get_string(size_average, reduce)
-> 3053 return torch._C._nn.cross_entropy_loss(input, target, weight, _Reduction.get_enum(reduction), ignore_index, label_smoothing)
3054
3055
ValueError: Expected input batch_size (8) to match target batch_size (48).
Hello,
I solved this issue.
Finetuning Whisper with classification head goes the same as other classification models, no matter the seq2seq architecture.
Only difference from finetuning for example Hubert is not using “attention_mask” and don’t limit the feature extractors max_length - the inputs shall be 30s.
The colab notebook is updated now.
1 Like