As someone new in the RAG world, I wanted to know which embedding model is actually the best. There are lots of leaderboards out there, but none of them guarantee the same results on your own dataset. So I tested it myself.
I took:
-
8 datasets (2 private, 2 multilingual, 4 public)
-
13 popular embedding models
-
logged latency and accuracy
-
and calculated an ELO score by letting an LLM judge which model retrieved the better top-5 list
What I expected was a clear separation. But what I got was the opposite.
- ~85% of models fall in the same narrow 50-ELO range
- The top 4 models are only ~23.5 ELO points apart
- Rank 1 → rank 10 is roughly a 3% difference
The gaps are so small that, in practice, many of these models behave almost the same.
When I looked into why, it made sense: they’re all trained to solve the same narrow task, on similar data, with similar objectives. Naturally, they end up in the same performance range.
So, what I got from this experiment was that choosing the “perfect” embedding model isn’t a big decision anymore. Maybe the real difference comes from the other parts of the pipeline.
If you want to dive deeper into actual numbers, here is the full breakdown.