Speaker diarization with Whisper?

Any suggestions for speaker diarization with Whisper? pyannote or there other alternatives?

There’s support for Whisper + pyannote speaker diarization in Speechbox: GitHub - huggingface/speechbox

In my experience, the pre-trained pyannote models work very well, but there’s the option of fine-tuning these models too.

We can drop in any fine-tuned Whisper/pyannote models directly into the Speechbox pipeline :hugs:

