What are the parameters the pyannote embedding model was trained on?

I read that pyannote use this embedding model:
X-VECTORS: ROBUST DNN EMBEDDINGS FOR SPEAKER RECOGNITION

  • According to the paper they trained on audio files with sample rate of 8K.
  • The model is a classification model with final layer of N (N - different speakers)
  • The input size was up to 3 seconds length with frame length of 25ms
  1. What is the value of N ? (I didn’t found it in the paper)
  2. According to the frame-length (i.e 25ms) there is no need to be an assumption on the speech length ? (am I right ?) (we can get embedding vector for different speech lengths) ?
  3. If the model trained on speeches with SR of 8K, Do I need to resample any speech to 8K before getting it’s embedding vector ?
1 Like