pyannote for speaker diarization based on the following segmentation model:
End-to-end speaker segmentation for overlap-aware resegmentation
In the above paper they wrote, under the Implementation details:
- model input: sequences of 80000 samples
[i.e: 5s audio chunks with a sampling rate of 16kHz] - model output:
K max -dimensional speaker activations between 0 and 1 every 16ms.
- Does it means that the output shape is (K, 5000/16) ?
- The output values are between 0 and 1. how to interpret it ?
How to conclude if we have a new segment or number of segments in each output ? number of speaker in output ? (example will be very helpful)