Environment
- pyannote.audio version: 3.1.1
- torch version: 2.5.1+cu124
- Platform: [your OS]
- CUDA: Yes
- GPU: [your GPU model]
- Python version: [your version]
- torchaudio version: [your version]
Issue Description
Using pyannote/embedding for speaker verification, all speakers are getting perfect similarity scores (1.000) when compared to a reference sample. This occurs even between obviously different speakers in a professional audiobook (Dracula), where speakers have distinct voices despite all being British.
Reproduction Steps
- Load a 10-minute reference audio of target speaker (FLAC format)
- Load full audiobook (4 hours, FLAC format)
- Extract embeddings using pyannote/embedding
- Compare embeddings using cosine similarity
- Result: ALL speakers match with 1.000 similarity
Current Behavior
- Every speaker gets similarity scores of 0.999+ to 1.000
- This happens consistently across different speakers
- Reference and speaker embeddings both have shape [1, 512]
- Even clearly different voices (male/female) get perfect matches
Code
python
Complete minimal example to reproduce the issue
import torch
import torchaudio
from pyannote.audio import Model
import torch.nn.functional as F
Load reference audio
reference_waveform, sample_rate = torchaudio.load(âreference.flacâ)
reference_waveform = reference_waveform.mean(dim=0, keepdim=True)
Setup model
device = torch.device(âcudaâ)
embedding_model = Model.from_pretrained(âpyannote/embeddingâ,
use_auth_token=â[REDACTED]â).to(device)
Get reference embedding
reference_features = embedding_model(reference_waveform.unsqueeze(0))
reference_features = F.normalize(reference_features, p=2, dim=1)
Process test audio
test_waveform, = torchaudio.load(âtest.flacâ)
test_waveform = test_waveform.mean(dim=0, keepdim=True)
speaker_embedding = embedding_model(test_waveform.unsqueeze(0))
speaker_embedding = F.normalize(speaker_embedding, p=2, dim=1)
Calculate similarity
similarity = F.cosine_similarity(reference_features, speaker_embedding, dim=1).mean()
print(f"Similarity: {similarity.item():.6f}")
Debug Information
Model Configuration
print(embedding_model)
[Output of model architecture]
Tensor Shapes and Values
Reference waveform shape: [1, 31246073]
Reference embedding shape: [1, 512]
Test embedding shape: [1, 512]
Example similarity scores between different speakers:
Speaker A vs Reference: 1.000000
Speaker B vs Reference: 0.999998
Speaker C vs Reference: 1.000000
Questions
- Is this expected behavior with the current version?
- Could the version mismatch warnings be causing this?
- Are there recommended settings to get realistic similarity scores?
- Should we be using a different approach for speaker verification?
Additional Notes
- Using professional audiobook with high-quality audio
- Multiple speakers are clearly different to human ears
- Tried with different audio segments and speakers
- Consistent 1.000 similarity across all tests