Detecting who is speaking?

lukebelbina · March 3, 2024, 6:04pm

I’m working on a business that transcribes audio at scale and does a bunch of analysis (traditional, ML, and LLM stuff).

A lot of the audio is interviews and I have access to the transcripts, and diarization info.

What are thoughts on how to go about detecting who is speaking?

My initial thoughts are just to play around with some local models, feed them a block of text that is toward the beginning and then ask them if its clear who is speaking. Are there any tips / recs on that and/or should I be looking to fine tune something?

Also, out of curiosity are there models / solutions available to detect the speaker based on the sound alone. Example. if I had 20 30 second samples of someone speaking, could I cost effectively be able to detect if a voice matches? I am thinking this would be useful for the most common speakers.

Thanks for the help!

Topic		Replies	Views
Pyannote/speaker-diarization-3.1 recognising a particular speech Beginners	1	50	May 5, 2025
I need a person identification by voice model Models	0	320	November 14, 2023
Speaker Diarization Models	0	95	December 2, 2024
Speaker diarization with Whisper? Beginners	1	5130	January 31, 2023
Get an AI model to indicate who a quote belongs to Beginners	3	37	June 17, 2025

Detecting who is speaking?

Related topics