I read this paper: X-Vectors: Robust DNN Embeddings for Speaker Recognition which describes how PyAnnote embedding block works.
I’m not sure I understand how the X-Vector model was trained and tested:
- According to the paper, there is a DNN that was trained with N speakers (as classification task).
- Here’s the model:
- For embedding vectors, they exclude the last 2 layers of the DNN.
- They used LDA to reduce the dimensions of the embedding from 512 to 150 and run PLDA model.
- What is the “total context” in the model ?
- If they trained the DNN model with N-Speakers classification task, why do they need to run LDA + PLDA ?
- Are there any learn parameters on the LDA + PLDA step ?
- In order to produce an embedding vector, do we also need to run the LDA+PLDA step ?