Combining pyannote with whisper to get a given speaker's text in Hebrew

Hi, I want to transcribe an audio dialouge of two speakers in Hebrew and assign speaker labels to the text segments . I’m trying to combine pyannote speaker diarization with whisper. Does anyone can share an example of this task (with models that support Hebrew)?

1 Like

Hi there!

You can definitely combine Pyannote for speaker diarization and Whisper for transcription to process a Hebrew audio file. Here’s an example workflow:


Step 1: Install Required Libraries

Make sure you have the necessary libraries installed. You’ll need:

  • pyannote.audio for speaker diarization.
  • openai-whisper (or Hugging Face’s Whisper integration) for transcription.

Install them with:

pip install pyannote.audio openai-whisper

Step 2: Perform Speaker Diarization

You can use Pyannote’s pre-trained speaker diarization models to segment the audio by speaker. Here’s an example:

from pyannote.audio import Pipeline

# Load the pre-trained speaker diarization pipeline
pipeline = Pipeline.from_pretrained("pyannote/speaker-diarization")

# Path to your Hebrew audio file
audio_path = "your_audio_file.wav"

# Perform diarization
diarization = pipeline(audio_path)

# Print diarization results
for segment, _, speaker in diarization.itertracks(yield_label=True):
    print(f"{segment.start:.2f} - {segment.end:.2f}: {speaker}")

This will output time-stamped segments with speaker labels, such as:

0.00 - 5.12: Speaker 1
5.12 - 10.34: Speaker 2

Step 3: Transcribe the Audio Using Whisper

Whisper supports transcription for Hebrew (including multilingual models like large or medium). Here’s an example:

import whisper

# Load the Whisper model
model = whisper.load_model("large")

# Transcribe the audio
result = model.transcribe(audio_path, language="he")

# Print the transcription
print(result["text"])

Step 4: Combine Diarization and Transcription

You can merge the speaker diarization results with Whisper’s transcription by splitting the transcription into segments based on the diarization output. Here’s a full example:

from pyannote.audio import Pipeline
import whisper
import torchaudio

# Load the diarization pipeline and Whisper model
diarization_pipeline = Pipeline.from_pretrained("pyannote/speaker-diarization")
whisper_model = whisper.load_model("large")

# File path to your audio file
audio_path = "your_audio_file.wav"

# Step 1: Perform speaker diarization
diarization = diarization_pipeline(audio_path)

# Step 2: Load audio for segmentation
waveform, sample_rate = torchaudio.load(audio_path)
if sample_rate != 16000:
    waveform = torchaudio.transforms.Resample(orig_freq=sample_rate, new_freq=16000)(waveform)

# Step 3: Iterate through diarization segments
segments_text = []
for segment, _, speaker in diarization.itertracks(yield_label=True):
    # Extract segment audio
    start, end = segment.start, segment.end
    segment_audio = waveform[:, int(start * 16000): int(end * 16000)]

    # Save segment audio temporarily
    torchaudio.save("temp_segment.wav", segment_audio, 16000)

    # Transcribe the segment
    transcription = whisper_model.transcribe("temp_segment.wav", language="he")["text"]

    # Append results
    segments_text.append(f"{speaker}: {transcription}")

# Step 4: Print combined results
for text in segments_text:
    print(text)

Example Output

Speaker 1: שלום, איך אתה?
Speaker 2: אני בסדר, תודה.
Speaker 1: טוב לשמוע.

Notes

  1. Language Support: Use Whisper’s large or medium model for optimal Hebrew transcription.
  2. Speaker Overlap: Pyannote handles overlapping speakers but merging overlapping segments with transcription might require more advanced processing.
  3. Audio Quality: Ensure your audio has good quality for better diarization and transcription accuracy.

Let me know if you need further clarification! :blush:

1 Like