How to classify audio into other/breath/speech with precise timestamps?

I want to predict where other (noise, silence, other signals), breaths or voice activity (including singing) occur in an audio source.

Should I retrain the MIT/AST pretrained models to supply me with timing data as well?
Or would it be better to use pretrained models for speaker diarization?
Or would it make sense to train my own model from sratch?

Im very new to machine learning, help is very much appreciated :slight_smile: