Hello everyone,
I am currently trying to build a RAG system using FAISS, a domain-adapted tokenizer and language model. However, I am encountering some issues with the results I am getting. Even when I avoid fine-tuning the tokenizer and the model, the results are still unsatisfactory.
Here are the steps I followed to build the search index, which are largely based on the instructions provided in the HF course.
dataset = load_dataset("myCompany/help-desk-emails-20240124", token=True) # it contains all the email from the support of my company
tokenizer = AutoTokenizer.from_pretrained("distilbert/distilbert-base-multilingual-cased")
model = AutoModel.from_pretrained("distilbert/distilbert-base-multilingual-cased")
embedding_dataset = dataset.map(lambda x: {"embeddings": embed(x["text"], model, tokenizer).detach().cpu().numpy()[0]})
embedding_dataset.add_faiss_index(column="embeddings")
where the embed function is
def embed(texts: list[str], model: AutoModel, tokenizer: AutoTokenizer, max_length: int = 300) -> torch.Tensor:
encoded_input = tokenizer(texts, padding="max_length", truncation=True, max_length=max_length, return_tensors="pt")
encoded_input = {k: v.to(device) for k, v in encoded_input.items()}
model_output = model(**encoded_input, output_hidden_states=True)
return cls_pooling(model_output)
Everything runs smoothly up to this point. However, when I try to query the index, the results are poor. Here’s an example:
question = "Issues with carton magazine"
question_embedding = embed([question], model, tokenizer).cpu().detach().numpy()[0]
scores, samples = embeddings_dataset.get_nearest_examples("embeddings", question_embedding, k=5)
The results I get are as follows:
SCORE: 28.32615089416504
TEXT: information is strictly prohibited . P Before printing think about environment and costs
==================================================
SCORE: 28.32615089416504
TEXT: information is strictly prohibited . P Before printing think about environment and costs
==================================================
SCORE: 28.32615089416504
TEXT: information is strictly prohibited . P Before printing think about environment and costs
==================================================
SCORE: 28.32615089416504
TEXT: information is strictly prohibited . P Before printing think about environment and costs
==================================================
SCORE: 28.32615089416504
TEXT: information is strictly prohibited . P Before printing think about environment and costs
==================================================
As you can see, the results are not relevant to the query. I am unsure about what I might be doing wrong. Any help or guidance would be greatly appreciated.
Env data:
- `transformers` version: 4.38.2
- Platform: Linux-6.6.10-76060610-generic-x86_64-with-glibc2.35
- Python version: 3.11.5
- Huggingface_hub version: 0.21.3
- Safetensors version: 0.4.2
- Accelerate version: 0.27.2
- PyTorch version (GPU?): 2.2.1+cu121 (True)