Audio classification VS Transcribing and using classifier

vazgbruno · December 29, 2023, 2:45pm

Hey

I’m working on a problem in which I have to perform classification of audio data (consisting of just 1 person speaking per audio file). There are several ways of doing this, but two of them are a) using an audio classifier (fine-tune it) to attribute a label to that specific audio file and b) transcribe the audio file and use a text classifier to attribute the label.

From what I’ve seen, an given my experience with these technologies, it was easier to perform transcription and then fine-tuning a BERT model to do the job. The fine-tuning was quite fast, the inference time is very low and the results are rather good. With the audio classifier I’m still trying to fine-tune it (and learning how to do it along the way ).

What are the advantages of one strategy vs the other? Intuitively, is there one that should hold better results beforehand? I’m new in this area, so just trying to learn and make informed decisions.

All the help is appreciated!

Cheers

(I’ve also posted my doubt in StackOverflow – here)

Topic		Replies	Views
Bert for audio classification Research	0	1157	April 25, 2022
Multi-class Classification Basics Beginners	4	4531	August 24, 2021
Model for Audio classification 🤗Transformers	2	1185	January 23, 2023
Classifying text based on intent using bert Intermediate	0	37	July 29, 2024
Fine-tuning Whisper for Audio Classification Models	6	3239	November 8, 2024

Audio classification VS Transcribing and using classifier

Related topics