Reduced WavLMForXVector performance on LibriSpeech


I’ve been benchmarking WavLMForXVector on LibriSpeech data and the result I get is EER = 4.7% while the WavLM paper (table II) quotes EER = 0.84% for WavLM Base+.

I used the example code from the docs (WavLM), but loading the data from a hard drive with the soundfile library. I also noticed that the example code seems to be missing the adaptive s-norm component that they used in the paper but I wonder if this would be enough to cause the performance to worsen so much.

Any ideas what I’m getting wrong?

There is a mistake in my original question - I used the VoxCeleb dataset for testing, not LibriSpeech