Finetuning Wave2Vec vs. Finetuning Distilbert

https://colab.research.google.com/github/m3hrdadfi/soxan/blob/main/notebooks/Emotion_recognition_in_Greek_speech_using_Wav2Vec2.ipynb#scrollTo=pFSqZ0jwCMSv

I have been following the above documentation but find it seems to be out of date in a few ways. I have been running into issues right around the point where he sets a bunch of jagged ndarrays in the preprocess map function.

When I finetuned a distillbert model on IMDB reviews, the process was much simpler, as I was just able to load the model in, specify a different number of classes, and then finetune it using a trainer object.

Can someone explain to me:

  1. Is there an updated way to finetune an audio model for speech emotion analysis?
  2. Why do the processes so drastically differ between the audio and text sentiment models?
  3. Why do we have to attach a completely different classifier head as is done in the link I provided above via torch? I was able to do the entire finetuning for the text model in transformers only.

Thanks in advance for the information, any information about how things have changed since the above documentation was created really helps (Python 3.7 was used, so it was probably awhile ago).

Posting here as I found a better guide that’s a bit more up to date:

https://towardsdatascience.com/fine-tuning-hubert-for-emotion-recognition-in-custom-audio-data-using-huggingface-c2d516b41cd8

It appears that the reason why it was easier to use Distilbert and this guide above is that additional classes have been created that attach a different classifier head to the model for you. Be careful with using old documentation/tutorials, as this package appears to update quite frequently, making old more deliberate methods obsolete (and making our lives easier!)