There’s support for Whisper + pyannote speaker diarization in Speechbox: GitHub - huggingface/speechbox
In my experience, the pre-trained pyannote models work very well, but there’s the option of fine-tuning these models too.
We can drop in any fine-tuned Whisper/pyannote models directly into the Speechbox pipeline