How the embedding model (x-vectors) trained?

I read this paper: X-Vectors: Robust DNN Embeddings for Speaker Recognition which describes how PyAnnote embedding block works.

I’m not sure I understand how the X-Vector model was trained and tested:

  • According to the paper, there is a DNN that was trained with N speakers (as classification task).
  • Here’s the model:


  • For embedding vectors, they exclude the last 2 layers of the DNN.
  • They used LDA to reduce the dimensions of the embedding from 512 to 150 and run PLDA model.
  1. What is the “total context” in the model ?
  2. If they trained the DNN model with N-Speakers classification task, why do they need to run LDA + PLDA ?
  3. Are there any learn parameters on the LDA + PLDA step ?
  4. In order to produce an embedding vector, do we also need to run the LDA+PLDA step ?