How to find closest embedding vectors?

I have 100K known embedding i.e.

[emb_1, emb_2, ..., emb_100000]

My task is given an embedding(embedding_new) find the closest 10 embedding from the above 100k embedding.

The way I am approaching this problem is brute force.

Given a query to find the closest embeddings, I compare embedding_new with [emb_1, emb_2, ..., emb_100000] and get the similarity score.

Then I do quicksort of the similarity score to get the top 10 closest embedding.

Is there a better way to achieve this?

2 Likes

Hey there! I don’t think you can get much faster than this, but I may be wrong. Once you have the list of similarities, you can use torch.topk to obtain the top results of interest.

2 Likes

Yes, please have a look at Nearest Neighbour Search algorithms like Faiss, Hnsw, etc. You can also you libraries like annoy.

Annoy Library: GitHub - spotify/annoy: Approximate Nearest Neighbors in C++/Python optimized for memory usage and loading/saving to disk

The above mentioned algorithms use special data structures like graphs/trees to reduce the run time from linear.

Thanks,
Sharan Babu

2 Likes