Speaker Verification: All Speakers Getting Perfect 1.000 Similarity Scores

Environment

  • pyannote.audio version: 3.1.1
  • torch version: 2.5.1+cu124
  • Platform: [your OS]
  • CUDA: Yes
  • GPU: [your GPU model]
  • Python version: [your version]
  • torchaudio version: [your version]

Issue Description

Using pyannote/embedding for speaker verification, all speakers are getting perfect similarity scores (1.000) when compared to a reference sample. This occurs even between obviously different speakers in a professional audiobook (Dracula), where speakers have distinct voices despite all being British.

Reproduction Steps

  1. Load a 10-minute reference audio of target speaker (FLAC format)
  2. Load full audiobook (4 hours, FLAC format)
  3. Extract embeddings using pyannote/embedding
  4. Compare embeddings using cosine similarity
  5. Result: ALL speakers match with 1.000 similarity

Current Behavior

  • Every speaker gets similarity scores of 0.999+ to 1.000
  • This happens consistently across different speakers
  • Reference and speaker embeddings both have shape [1, 512]
  • Even clearly different voices (male/female) get perfect matches

Code

python
Complete minimal example to reproduce the issue
import torch
import torchaudio
from pyannote.audio import Model
import torch.nn.functional as F
Load reference audio
reference_waveform, sample_rate = torchaudio.load(“reference.flac”)
reference_waveform = reference_waveform.mean(dim=0, keepdim=True)
Setup model
device = torch.device(“cuda”)
embedding_model = Model.from_pretrained(“pyannote/embedding”,
use_auth_token=‘[REDACTED]’).to(device)
Get reference embedding
reference_features = embedding_model(reference_waveform.unsqueeze(0))
reference_features = F.normalize(reference_features, p=2, dim=1)
Process test audio
test_waveform, = torchaudio.load(“test.flac”)
test_waveform = test_waveform.mean(dim=0, keepdim=True)
speaker_embedding = embedding_model(test_waveform.unsqueeze(0))
speaker_embedding = F.normalize(speaker_embedding, p=2, dim=1)
Calculate similarity
similarity = F.cosine_similarity(reference_features, speaker_embedding, dim=1).mean()
print(f"Similarity: {similarity.item():.6f}")

Debug Information

Model Configuration

print(embedding_model)
[Output of model architecture]
Tensor Shapes and Values
Reference waveform shape: [1, 31246073]
Reference embedding shape: [1, 512]
Test embedding shape: [1, 512]
Example similarity scores between different speakers:
Speaker A vs Reference: 1.000000
Speaker B vs Reference: 0.999998
Speaker C vs Reference: 1.000000

Questions

  1. Is this expected behavior with the current version?
  2. Could the version mismatch warnings be causing this?
  3. Are there recommended settings to get realistic similarity scores?
  4. Should we be using a different approach for speaker verification?

Additional Notes

  • Using professional audiobook with high-quality audio
  • Multiple speakers are clearly different to human ears
  • Tried with different audio segments and speakers
  • Consistent 1.000 similarity across all tests
1 Like