How to classify audio into other/breath/speech with precise timestamps?

moritzsur · January 9, 2024, 1:35pm

I want to predict where other (noise, silence, other signals), breaths or voice activity (including singing) occur in an audio source.

Should I retrain the MIT/AST pretrained models to supply me with timing data as well?
Or would it be better to use pretrained models for speaker diarization?
Or would it make sense to train my own model from sratch?

Im very new to machine learning, help is very much appreciated

Topic		Replies	Views
Bert for audio classification Research	0	1157	April 25, 2022
Recognizing timestamps for patterns in spectrogram using machine learning model Models	0	94	June 4, 2024
Detecting who is speaking? Beginners	0	503	March 3, 2024
Text to Speech Alignment with Transformers Research	2	5501	April 20, 2022
ML for Audio Study Group - Kick Off (Dec 14) Community Calls	13	2408	December 16, 2021

How to classify audio into other/breath/speech with precise timestamps?

Related topics