Hi, I hope this message finds you well. I am reaching out to seek clarification regarding fine-tuning the Whisper model for audio classification.
I have been exploring the Hugging Face resources and documentation, but unfortunately, I couldn’t find any specific blogs or example notebooks specifically addressing the fine-tuning of Whisper for audio classification tasks. Most of the available resources appear to focus on Automatic Speech Recognition (ASR) tasks instead. Nonetheless, I experiment and implemented some code to fine-tune Whisper for audio classification. After a considerable amount of time and effort, I managed to make progress without encountering any errors. However, due to the lack of dedicated resources, I would appreciate confirmation that the approach I followed is indeed correct.
What I did for fine-tuning Whisper for audio classification was similar to the process used for fine-tuning other transformer models like Wav2Vec and Hubert, with the only difference being the function for the feature extractor. In my implementation, I defined a function as follows:
def prepare_dataset(batch):
# load and resample audio data from 48 to 16kHz
audio = batch["audio"]
# compute log-Mel input features from input audio array
batch["input_features"] = feature_extractor(audio["array"], sampling_rate=audio["sampling_rate"]).input_features[0]
# there is no need for encoding label
batch["label"] = batch["label"]
return batch
And after that I applied this function to the entire dataset in the following manner:
encoded_dataset = dataset.map(prepare_dataset, remove_columns="audio", num_proc=4)
My question is, can I confirm whether this approach is correct and valid?
And if not, I kindly request your guidance or insights on the appropriate methodology for fine-tuning Whisper for audio classification.
I appreciate any help