Can Wav2Vec2 distinguish music during speech-to-text?

Hi everyone,

I have a custom dataset where the test set contains music data corresponding to 10% of the test set. I want to label music data as “music” during performing speech-to-text. I am using the wav2vec2 model for speech-to-text and want to know if distinguishing music data from the text data during speech-to-text is possible to achieve with wav2vec2. I tried to do it but the lost value got started really high then decreased until zero after a while. If anyone can guide me through it, I will be appreciated.

Thanks!