We are having a thesis project on Podcast Trailer Generation - Hotspot Detection for Podcast Dataset at Spotify.
The Spotify Podcast Dataset contains both transcript and audio data for many podcast episodes, and currently we are looking to use Wav2Vec2 embeddings as input to train an emotion classification model for the audio data. The audio data is currently only in English (with accompanied transcript).
It would be much appreciated if you could help out with fine-tuning Wav2Vec2 on some standard emotion-annotated audio datasets (e.g. RAVDESS, SAVEE). We will then use the fine-tuned embeddings as input for emotion classification, after which we will have human evaluation on the classified results.
That sounds great. I’m also working with fine-tuning Wav2Vec2. I can help you out if you have any questions. @patrickvonplaten is also a great person to ask.
Thanks for your post here! I think it would be a good idea to use Wav2Vec2 for emotion classification. I won’t find time to fine-tune the model myself any time soon, but it should be rather straightforward to do so. Things that need to be done before being able to fine-tune Wav2Vec2 on emotion classification.
- Add a
Wav2Vec2ForSpeechClassification model that would be very similar to how
BertForSequenceClassification is implemented.
- It would be probably much easier to train the model if the two datasets you linked above would be added to
datasets. It should be rather straight-forward to do this yourself or else you can put up a
dataset request issue in the library and maybe someone in the open-source community is interested in tackling the issue. See those issues for example: Issues · huggingface/datasets · GitHub
Having added those things it should be rather straight-forward to train the model. You can look for “transformers” sentiment analysis online to get a feeling for how it should be done. See this article for example: https://curiousily.com/posts/sentiment-analysis-with-bert-and-hugging-face-using-pytorch-and-python/
@Winstead, have you made any progress on this?
I am also working on a similar project and will be happy to collaborate/assist.
Hi @othrif !
I’ve written the basic code and trained on some emotion-annotated speech dataset, but the accuracy has not been good so far. As I’m also working on several other approaches in the meantime and haven’t spent a lot of time on fine-tuning wav2vec2, I believe there is a lot of room for improvement.
And yes, I would be glad to collaborate/discuss on this. How would you prefer to communicate?
@Winstead, It would probably solve your problem.
@m3hrdadfi Awesome! Thank you so much!