RAG: Do we need to pretrained the doc-encoder when using a custom dataset?

Now the Huggiface RAG consists of a script where we can use a custom dataset other than the wiki-dataset.

Since, in the fine-tuning phase of the RAG, we do not update the doc-encoder (we update only BART and Question Encoder), what if our custom dataset consists of different distribution compared to the wiki dataset (Ex: medical records)?

Will it still work?

P.S - In the RAG paper authors just used the pretrained DPR and they never updated the doc encoder weights in the fine-tuning mechanism.

1 Like