Fine-tuning Whisper for Audio Classification

Zahra99 · June 27, 2023, 1:33pm

Hi, I hope this message finds you well. I am reaching out to seek clarification regarding fine-tuning the Whisper model for audio classification.

I have been exploring the Hugging Face resources and documentation, but unfortunately, I couldn’t find any specific blogs or example notebooks specifically addressing the fine-tuning of Whisper for audio classification tasks. Most of the available resources appear to focus on Automatic Speech Recognition (ASR) tasks instead. Nonetheless, I experiment and implemented some code to fine-tune Whisper for audio classification. After a considerable amount of time and effort, I managed to make progress without encountering any errors. However, due to the lack of dedicated resources, I would appreciate confirmation that the approach I followed is indeed correct.

What I did for fine-tuning Whisper for audio classification was similar to the process used for fine-tuning other transformer models like Wav2Vec and Hubert, with the only difference being the function for the feature extractor. In my implementation, I defined a function as follows:

def prepare_dataset(batch):
    # load and resample audio data from 48 to 16kHz
    audio = batch["audio"]

    # compute log-Mel input features from input audio array
    batch["input_features"] = feature_extractor(audio["array"], sampling_rate=audio["sampling_rate"]).input_features[0]

    # there is no need for encoding label
    batch["label"] = batch["label"]
    return batch

And after that I applied this function to the entire dataset in the following manner:

encoded_dataset = dataset.map(prepare_dataset, remove_columns="audio", num_proc=4)

My question is, can I confirm whether this approach is correct and valid?

And if not, I kindly request your guidance or insights on the appropriate methodology for fine-tuning Whisper for audio classification.

I appreciate any help

Zahra99 · July 4, 2023, 6:23am

@sanchit-gandhi
Could you please help me with this issue?

avar · July 9, 2023, 4:11am

Hi,
Could you give more details? Or the process of whole fine-tuning? Thanks

mirix · October 13, 2023, 1:27pm

I was considering attempting a similar approach.

How do the results compare to those obtained with other models such as wav2vec2?

mdege · October 9, 2024, 12:50pm

I’m a newbie to the finetuning process, and would also like to try and finetune whisper on classification tasks. Do you have a github with a more detailed code, or more information ?
Can you use your personal datasets ?
Thank you !

Zahra99 · October 9, 2024, 8:25pm

Hi @ mdege,

You can find the code in my repository here: GitHub link

Regarding the dataset, I used the IEMOCAP dataset for fine-tuning, but you can definitely try it with your own dataset.

thelou1s · November 8, 2024, 3:03am

what’s the most accurate model for audio classification? from user side

Topic		Replies	Views
Whisper for Audio Classification 🤗Transformers	3	2823	October 9, 2024
Whisper fine-tuning on Librispeech makes WER worse 🤗Transformers	6	2407	June 26, 2023
How to finetune whisper model 🤗Transformers	0	565	May 7, 2023
Has Anyone Successfully Fine-Tuned Whisper for a Local Language for better accuracy Beginners	5	194	May 27, 2025
Replicating the Fine-Tune Whisper For Multilingual ASR with 🤗 Transformers on MPS Beginners	0	281	July 24, 2024

Fine-tuning Whisper for Audio Classification

Related topics