Audio classification VS Transcribing and using classifier

Hey :wave:

I’m working on a problem in which I have to perform classification of audio data (consisting of just 1 person speaking per audio file). There are several ways of doing this, but two of them are a) using an audio classifier (fine-tune it) to attribute a label to that specific audio file and b) transcribe the audio file and use a text classifier to attribute the label.

From what I’ve seen, an given my experience with these technologies, it was easier to perform transcription and then fine-tuning a BERT model to do the job. The fine-tuning was quite fast, the inference time is very low and the results are rather good. With the audio classifier I’m still trying to fine-tune it (and learning how to do it along the way :yum:).

What are the advantages of one strategy vs the other? Intuitively, is there one that should hold better results beforehand? I’m new in this area, so just trying to learn and make informed decisions.

All the help is appreciated!

Cheers :raised_hands:

(I’ve also posted my doubt in StackOverflow – here)