Speaker Verification: All Speakers Getting Perfect 1.000 Similarity Scores

theasianpathologist · February 10, 2025, 8:51pm

Environment

pyannote.audio version: 3.1.1
torch version: 2.5.1+cu124
Platform: [your OS]
CUDA: Yes
GPU: [your GPU model]
Python version: [your version]
torchaudio version: [your version]

Issue Description

Using pyannote/embedding for speaker verification, all speakers are getting perfect similarity scores (1.000) when compared to a reference sample. This occurs even between obviously different speakers in a professional audiobook (Dracula), where speakers have distinct voices despite all being British.

Reproduction Steps

Load a 10-minute reference audio of target speaker (FLAC format)
Load full audiobook (4 hours, FLAC format)
Extract embeddings using pyannote/embedding
Compare embeddings using cosine similarity
Result: ALL speakers match with 1.000 similarity

Current Behavior

Every speaker gets similarity scores of 0.999+ to 1.000
This happens consistently across different speakers
Reference and speaker embeddings both have shape [1, 512]
Even clearly different voices (male/female) get perfect matches

Code

python
Complete minimal example to reproduce the issue
import torch
import torchaudio
from pyannote.audio import Model
import torch.nn.functional as F
Load reference audio
reference_waveform, sample_rate = torchaudio.load(“reference.flac”)
reference_waveform = reference_waveform.mean(dim=0, keepdim=True)
Setup model
device = torch.device(“cuda”)
embedding_model = Model.from_pretrained(“pyannote/embedding”,
use_auth_token=‘[REDACTED]’).to(device)
Get reference embedding
reference_features = embedding_model(reference_waveform.unsqueeze(0))
reference_features = F.normalize(reference_features, p=2, dim=1)
Process test audio
test_waveform, = torchaudio.load(“test.flac”)
test_waveform = test_waveform.mean(dim=0, keepdim=True)
speaker_embedding = embedding_model(test_waveform.unsqueeze(0))
speaker_embedding = F.normalize(speaker_embedding, p=2, dim=1)
Calculate similarity
similarity = F.cosine_similarity(reference_features, speaker_embedding, dim=1).mean()
print(f"Similarity: {similarity.item():.6f}")

Debug Information

Model Configuration

print(embedding_model)
[Output of model architecture]
Tensor Shapes and Values
Reference waveform shape: [1, 31246073]
Reference embedding shape: [1, 512]
Test embedding shape: [1, 512]
Example similarity scores between different speakers:
Speaker A vs Reference: 1.000000
Speaker B vs Reference: 0.999998
Speaker C vs Reference: 1.000000

Questions

Is this expected behavior with the current version?
Could the version mismatch warnings be causing this?
Are there recommended settings to get realistic similarity scores?
Should we be using a different approach for speaker verification?

Additional Notes

Using professional audiobook with high-quality audio
Multiple speakers are clearly different to human ears
Tried with different audio segments and speakers
Consistent 1.000 similarity across all tests

Topic		Replies	Views
Speaker Diarization Models	0	87	December 2, 2024
Pyannote gives wrong results Beginners	2	921	March 23, 2023
Pyannote/speaker-diarization-3.1 recognising a particular speech Beginners	1	41	May 5, 2025
Pyannote/speaker-diarization - [WinError 1314] A required privilege is not held by the client Beginners	5	7008	June 22, 2024
Combining pyannote with whisper to get a given speaker's text in Hebrew Beginners	1	348	January 14, 2025