I have been following the above documentation but find it seems to be out of date in a few ways. I have been running into issues right around the point where he sets a bunch of jagged ndarrays in the preprocess map function.
When I finetuned a distillbert model on IMDB reviews, the process was much simpler, as I was just able to load the model in, specify a different number of classes, and then finetune it using a trainer object.
Can someone explain to me:
- Is there an updated way to finetune an audio model for speech emotion analysis?
- Why do the processes so drastically differ between the audio and text sentiment models?
- Why do we have to attach a completely different classifier head as is done in the link I provided above via torch? I was able to do the entire finetuning for the text model in transformers only.
Thanks in advance for the information, any information about how things have changed since the above documentation was created really helps (Python 3.7 was used, so it was probably awhile ago).