Poor Results with FAISS Index on RAG System

@lewtun

Hello everyone,

I am currently trying to build a RAG system using FAISS, a domain-adapted tokenizer and language model. However, I am encountering some issues with the results I am getting. Even when I avoid fine-tuning the tokenizer and the model, the results are still unsatisfactory.

Here are the steps I followed to build the search index, which are largely based on the instructions provided in the HF course.

dataset = load_dataset("myCompany/help-desk-emails-20240124", token=True) # it contains all the email from the support of my company
tokenizer = AutoTokenizer.from_pretrained("distilbert/distilbert-base-multilingual-cased")
model = AutoModel.from_pretrained("distilbert/distilbert-base-multilingual-cased")
embedding_dataset = dataset.map(lambda x: {"embeddings": embed(x["text"], model, tokenizer).detach().cpu().numpy()[0]})
embedding_dataset.add_faiss_index(column="embeddings")

where the embed function is

def embed(texts: list[str], model: AutoModel, tokenizer: AutoTokenizer, max_length: int = 300) -> torch.Tensor:
  encoded_input = tokenizer(texts, padding="max_length", truncation=True, max_length=max_length, return_tensors="pt")
  encoded_input = {k: v.to(device) for k, v in encoded_input.items()}
  model_output = model(**encoded_input, output_hidden_states=True)
  return cls_pooling(model_output)

Everything runs smoothly up to this point. However, when I try to query the index, the results are poor. Here’s an example:

question = "Issues with carton magazine"
question_embedding = embed([question], model, tokenizer).cpu().detach().numpy()[0]
scores, samples = embeddings_dataset.get_nearest_examples("embeddings", question_embedding, k=5)

The results I get are as follows:


SCORE: 28.32615089416504
TEXT: information is strictly prohibited . P Before printing think about environment and costs

==================================================

SCORE: 28.32615089416504
TEXT: information is strictly prohibited . P Before printing think about environment and costs

==================================================

SCORE: 28.32615089416504
TEXT: information is strictly prohibited . P Before printing think about environment and costs

==================================================

SCORE: 28.32615089416504
TEXT: information is strictly prohibited . P Before printing think about environment and costs

==================================================

SCORE: 28.32615089416504
TEXT: information is strictly prohibited . P Before printing think about environment and costs

==================================================

As you can see, the results are not relevant to the query. I am unsure about what I might be doing wrong. Any help or guidance would be greatly appreciated.

Env data:

- `transformers` version: 4.38.2
- Platform: Linux-6.6.10-76060610-generic-x86_64-with-glibc2.35
- Python version: 3.11.5
- Huggingface_hub version: 0.21.3
- Safetensors version: 0.4.2
- Accelerate version: 0.27.2
- PyTorch version (GPU?): 2.2.1+cu121 (True)