RAG: Embedding models have converged

As someone new in the RAG world, I wanted to know which embedding model is actually the best. There are lots of leaderboards out there, but none of them guarantee the same results on your own dataset. So I tested it myself.

I took:

  • 8 datasets (2 private, 2 multilingual, 4 public)

  • 13 popular embedding models

  • logged latency and accuracy

  • and calculated an ELO score by letting an LLM judge which model retrieved the better top-5 list

What I expected was a clear separation. But what I got was the opposite.

  • ~85% of models fall in the same narrow 50-ELO range
  • The top 4 models are only ~23.5 ELO points apart
  • Rank 1 → rank 10 is roughly a 3% difference

The gaps are so small that, in practice, many of these models behave almost the same.

When I looked into why, it made sense: they’re all trained to solve the same narrow task, on similar data, with similar objectives. Naturally, they end up in the same performance range.

So, what I got from this experiment was that choosing the “perfect” embedding model isn’t a big decision anymore. Maybe the real difference comes from the other parts of the pipeline.

If you want to dive deeper into actual numbers, here is the full breakdown.

2 Likes