I’m looking into audio diarization but with the caveat that the number of speakers is not known beforehand. This means ruling out models that need the number of speakers such as
My approach was to use
Wav2Vec2ForXVector on each audio snippet and use agglomerative clustering to cluster the vectors using cosine similarity and some value for
distance_threshold. Although the results on the training data were good (confusion matrix, etc.) the problem is that the results are very sensitive to the threshold value, to the point where the approach doesn’t generalize very well at all.
Has anyone attempted this problem and obtained more stable results?