I read that pyannote use this embedding model:
X-VECTORS: ROBUST DNN EMBEDDINGS FOR SPEAKER RECOGNITION
- According to the paper they trained on audio files with sample rate of 8K.
- The model is a classification model with final layer of N (N - different speakers)
- The input size was up to 3 seconds length with frame length of 25ms
- What is the value of N ? (I didn’t found it in the paper)
- According to the frame-length (i.e 25ms) there is no need to be an assumption on the speech length ? (am I right ?) (we can get embedding vector for different speech lengths) ?
- If the model trained on speeches with SR of 8K, Do I need to resample any speech to 8K before getting it’s embedding vector ?