Hi, all
Has anyone had success improving sentence embeddings for similarity search in a RAG setup?
I’ve experimented with the following approaches:
- Removing stopwords based on TF-IDF and doing lower strings, lemmatising etc…
- Filtering out low mutual information and low-entropy words
- Using various sentence embedding models (e.g. all-MiniLM, MPNet)
- Finetuning sentence embeddings via Siamese and Triplet networks with Multiple Negatives Ranking (MNR). Would this work with different sampling methods such as Hard Negative Sampling or different methods? And any tips on improving the sentence embedding through different training methods??
Despite these efforts, I’ve seen minimal improvement in retrieval performance.
Would love to hear if anyone has a solid workflow or other suggestions that worked well for them!
I’m only hoping to try non-API based embeddings since I don’t want to rely on API calls in later stages of production.
Thanks in advance!