Detecting who is speaking?

I’m working on a business that transcribes audio at scale and does a bunch of analysis (traditional, ML, and LLM stuff).

A lot of the audio is interviews and I have access to the transcripts, and diarization info.

What are thoughts on how to go about detecting who is speaking?

My initial thoughts are just to play around with some local models, feed them a block of text that is toward the beginning and then ask them if its clear who is speaking. Are there any tips / recs on that and/or should I be looking to fine tune something?

Also, out of curiosity are there models / solutions available to detect the speaker based on the sound alone. Example. if I had 20 30 second samples of someone speaking, could I cost effectively be able to detect if a voice matches? I am thinking this would be useful for the most common speakers.

Thanks for the help!