Hi, I want to transcribe an audio dialouge of two speakers in Hebrew and assign speaker labels to the text segments . I’m trying to combine pyannote speaker diarization with whisper. Does anyone can share an example of this task (with models that support Hebrew)?
Hi there!
You can definitely combine Pyannote for speaker diarization and Whisper for transcription to process a Hebrew audio file. Here’s an example workflow:
Step 1: Install Required Libraries
Make sure you have the necessary libraries installed. You’ll need:
pyannote.audio
for speaker diarization.openai-whisper
(or Hugging Face’s Whisper integration) for transcription.
Install them with:
pip install pyannote.audio openai-whisper
Step 2: Perform Speaker Diarization
You can use Pyannote’s pre-trained speaker diarization models to segment the audio by speaker. Here’s an example:
from pyannote.audio import Pipeline
# Load the pre-trained speaker diarization pipeline
pipeline = Pipeline.from_pretrained("pyannote/speaker-diarization")
# Path to your Hebrew audio file
audio_path = "your_audio_file.wav"
# Perform diarization
diarization = pipeline(audio_path)
# Print diarization results
for segment, _, speaker in diarization.itertracks(yield_label=True):
print(f"{segment.start:.2f} - {segment.end:.2f}: {speaker}")
This will output time-stamped segments with speaker labels, such as:
0.00 - 5.12: Speaker 1
5.12 - 10.34: Speaker 2
Step 3: Transcribe the Audio Using Whisper
Whisper supports transcription for Hebrew (including multilingual models like large
or medium
). Here’s an example:
import whisper
# Load the Whisper model
model = whisper.load_model("large")
# Transcribe the audio
result = model.transcribe(audio_path, language="he")
# Print the transcription
print(result["text"])
Step 4: Combine Diarization and Transcription
You can merge the speaker diarization results with Whisper’s transcription by splitting the transcription into segments based on the diarization output. Here’s a full example:
from pyannote.audio import Pipeline
import whisper
import torchaudio
# Load the diarization pipeline and Whisper model
diarization_pipeline = Pipeline.from_pretrained("pyannote/speaker-diarization")
whisper_model = whisper.load_model("large")
# File path to your audio file
audio_path = "your_audio_file.wav"
# Step 1: Perform speaker diarization
diarization = diarization_pipeline(audio_path)
# Step 2: Load audio for segmentation
waveform, sample_rate = torchaudio.load(audio_path)
if sample_rate != 16000:
waveform = torchaudio.transforms.Resample(orig_freq=sample_rate, new_freq=16000)(waveform)
# Step 3: Iterate through diarization segments
segments_text = []
for segment, _, speaker in diarization.itertracks(yield_label=True):
# Extract segment audio
start, end = segment.start, segment.end
segment_audio = waveform[:, int(start * 16000): int(end * 16000)]
# Save segment audio temporarily
torchaudio.save("temp_segment.wav", segment_audio, 16000)
# Transcribe the segment
transcription = whisper_model.transcribe("temp_segment.wav", language="he")["text"]
# Append results
segments_text.append(f"{speaker}: {transcription}")
# Step 4: Print combined results
for text in segments_text:
print(text)
Example Output
Speaker 1: שלום, איך אתה?
Speaker 2: אני בסדר, תודה.
Speaker 1: טוב לשמוע.
Notes
- Language Support: Use Whisper’s
large
ormedium
model for optimal Hebrew transcription. - Speaker Overlap: Pyannote handles overlapping speakers but merging overlapping segments with transcription might require more advanced processing.
- Audio Quality: Ensure your audio has good quality for better diarization and transcription accuracy.
Let me know if you need further clarification!