Fine-tuning Whisper for Audio Classification

Hi, I hope this message finds you well. I am reaching out to seek clarification regarding fine-tuning the Whisper model for audio classification.

I have been exploring the Hugging Face resources and documentation, but unfortunately, I couldn’t find any specific blogs or example notebooks specifically addressing the fine-tuning of Whisper for audio classification tasks. Most of the available resources appear to focus on Automatic Speech Recognition (ASR) tasks instead. Nonetheless, I experiment and implemented some code to fine-tune Whisper for audio classification. After a considerable amount of time and effort, I managed to make progress without encountering any errors. However, due to the lack of dedicated resources, I would appreciate confirmation that the approach I followed is indeed correct.

What I did for fine-tuning Whisper for audio classification was similar to the process used for fine-tuning other transformer models like Wav2Vec and Hubert, with the only difference being the function for the feature extractor. In my implementation, I defined a function as follows:

def prepare_dataset(batch):
    # load and resample audio data from 48 to 16kHz
    audio = batch["audio"]

    # compute log-Mel input features from input audio array
    batch["input_features"] = feature_extractor(audio["array"], sampling_rate=audio["sampling_rate"]).input_features[0]

    # there is no need for encoding label
    batch["label"] = batch["label"]
    return batch

And after that I applied this function to the entire dataset in the following manner:

encoded_dataset =, remove_columns="audio", num_proc=4)

My question is, can I confirm whether this approach is correct and valid?

And if not, I kindly request your guidance or insights on the appropriate methodology for fine-tuning Whisper for audio classification.

I appreciate any help :pray:


Could you please help me with this issue?

Could you give more details? Or the process of whole fine-tuning? Thanks

I was considering attempting a similar approach.

How do the results compare to those obtained with other models such as wav2vec2?