We are having a thesis project on Podcast Trailer Generation - Hotspot Detection for Podcast Dataset at Spotify.
The Spotify Podcast Dataset contains both transcript and audio data for many podcast episodes, and currently we are looking to use Wav2Vec2 embeddings as input to train an emotion classification model for the audio data. The audio data is currently only in English (with accompanied transcript).
It would be much appreciated if you could help out with fine-tuning Wav2Vec2 on some standard emotion-annotated audio datasets (e.g. RAVDESS, SAVEE). We will then use the fine-tuned embeddings as input for emotion classification, after which we will have human evaluation on the classified results.