RAG Model performance does not match paper

I am trying to reproduce the results obtained in the Retriever Augmented Generation paper for Question Answering on the Natural Questions (NQ) Dataset (Exact Match accuracy 44%).
However, I am not able to reproduce them. I am getting scores of around 40% but not more than that, in spite of trying with different document indexes and models.

Can someone kindly let me know which DPR model, DPR dataset and RAG dataset was used to obtain the 44% EM Accuracy on NQ dataset?