Understanding UniSpeech-SAT diarizaion output

Hello, I am trying to understand the output of the UniSpeech-SAT diarization model with this checkpoint microsoft/unispeech-sat-base-plus-sd · Hugging Face . In the sample code, there is a comment:

# labels is a one-hot array of shape (num_frames, num_speakers)

When I print the labels the model always outputs num_speakers equal to two even though there are more than two speakers in my recording. Also the probabilities seem quite random and change a lot between each frame (I have estimated that the frame duration for this model is 0.02s assuming that the windows are not overlapping, is this assumption correct?). Example of the sample of output probabilities on my file:

        [3.1691e-01, 1.8716e-03],
        [1.2288e-01, 1.1750e-01],
        [1.8027e-01, 2.5845e-03],
        [4.1805e-02, 5.4180e-03],
        [5.3213e-02, 7.5354e-03],
        [8.5969e-02, 1.3703e-02],
        [5.9220e-02, 2.8392e-01],
        [5.0971e-02, 4.1676e-02],
        [2.2690e-01, 1.3807e-02],

Any help with understanding the output would be much appreciated.

Thank you,

Jonas

1 Like