RAG custom dataset

I just saw that Facebook AI released a blog post about RAG ( Retrieval Augmented Generation: Streamlining the creation of intelligent natural language processing models) and that it is already incorporated in the HuggingFace API.

I looked quickly, and I couldn’t see how to use a custom dataset with it. It seems like it will only pull down indexed datasets from HuggingFace’s AWS storage. I’m wondering if anyone can show me how to

  1. Create an indexed dataset. I’m assuming this is just a big collection of embeddings that have been made by running documents through a model and taking the output embedding. I’m wondering which model(s) can be used, how many dimensions the embeddings are expected to be, and how to format all of these vectors.
  2. Use that custom dataset with HF Rag models.

Indeed, it’s actually very simple to do with datasets and somehow explained on this page: https://huggingface.co/docs/datasets/faiss_and_ea.html

We will add an example script on this.