Diarization with unknown number of speakers

Hi there!

I’m looking into audio diarization but with the caveat that the number of speakers is not known beforehand. This means ruling out models that need the number of speakers such as Wav2Vec2ForAudioFrameClassification.

My approach was to use Wav2Vec2ForXVector on each audio snippet and use agglomerative clustering to cluster the vectors using cosine similarity and some value for distance_threshold. Although the results on the training data were good (confusion matrix, etc.) the problem is that the results are very sensitive to the threshold value, to the point where the approach doesn’t generalize very well at all.

Has anyone attempted this problem and obtained more stable results?