Understanding UniSpeech-SAT diarizaion output

jonaskratochvil · January 14, 2022, 2:32pm

Hello, I am trying to understand the output of the UniSpeech-SAT diarization model with this checkpoint microsoft/unispeech-sat-base-plus-sd · Hugging Face . In the sample code, there is a comment:

# labels is a one-hot array of shape (num_frames, num_speakers)

When I print the labels the model always outputs num_speakers equal to two even though there are more than two speakers in my recording. Also the probabilities seem quite random and change a lot between each frame (I have estimated that the frame duration for this model is 0.02s assuming that the windows are not overlapping, is this assumption correct?). Example of the sample of output probabilities on my file:

        [3.1691e-01, 1.8716e-03],
        [1.2288e-01, 1.1750e-01],
        [1.8027e-01, 2.5845e-03],
        [4.1805e-02, 5.4180e-03],
        [5.3213e-02, 7.5354e-03],
        [8.5969e-02, 1.3703e-02],
        [5.9220e-02, 2.8392e-01],
        [5.0971e-02, 4.1676e-02],
        [2.2690e-01, 1.3807e-02],

Any help with understanding the output would be much appreciated.

Thank you,

Jonas

Topic		Replies	Views
How to interpret the output of the segmentation model? Models	0	239	April 4, 2023
Why do probabilities output for a model does not correspond to label predicted by the finetune model? Beginners	3	1374	December 3, 2021
Labels in Audio Frame classification task (Wav2Vec2 For Audio Frame Classification) 🤗Transformers	1	671	January 7, 2025
How to get the result probabilities fromT5 decoding output? 🤗Transformers	1	1000	October 30, 2022
Understanding model output arrays Beginners	0	616	August 28, 2022

Understanding UniSpeech-SAT diarizaion output

Related topics