Can Wav2Vec2 distinguish music during speech-to-text?

Hi everyone,

I have a custom dataset where the test set contains music data corresponding to 10% of the test set. I want to label music data as “music” during performing speech-to-text. I am using the wav2vec2 model for speech-to-text and want to know if distinguishing music data from the text data during speech-to-text is possible to achieve with wav2vec2. I tried to do it but the lost value got started really high then decreased until zero after a while. If anyone can guide me through it, I will be appreciated.


my organisation in the trouble my whole business is based on STT I need more accurate stt seamless m4t is not able to convert any audio fully I have little bit of noises audios. I have testing platform for student who is preparing for ILETS, PTE TOFEL in that I need to take answer of given question by student in audio form and for evaluating their answer I need to fully accurate text of those audio so I can analyse their grammar, mistakes so here I have used whisper in frontend but problem is whisper is doing auto correction and sometime stocking on one word and repeating again and again. I have used web speech api as well but it get stuck in between. I have huge amount of transcription thing in a month approx 80000 hours/ month.