DPR retriever module

I see https://github.com/huggingface/transformers/pull/5279 that describes the DPR flow.

Just checking to see when the retriever module will be available.
Many Thanks for making DPR available !

1 Like

I see this topic was already answered in Github from Quentin.
So, I’d love to add the answer here for convenience :slight_smile:

The retriever is now part of the nlp library.
You can install it with

pip install datasets

and load the retriever:

from datasets import load_dataset

wiki = load_dataset("wiki_dpr", with_embeddings=False, with_index=True, split="train")

The retriever is basically a dense index over wikipedia passages.
To query it using the DPR question encoder you can do:

from transformers import DPRQuestionEncoderTokenizer, DPRQuestionEncoder 
question_tokenizer = DPRQuestionEncoderTokenizer.from_pretrained('facebook/dpr-question_encoder-single-nq-base') 
question_encoder = DPRQuestionEncoder.from_pretrained('facebook/dpr-question_encoder-single-nq-base') 
question = "What is love ?" 

question_emb = question_encoder(**question_tokenizer(question, return_tensors="pt"))[0].detach().numpy() 

passages_scores, passages = wiki.get_nearest_examples("embeddings", question_emb, k=20) # get k nearest neighbors