I have 100K
known embedding i.e.
[emb_1, emb_2, ..., emb_100000]
My task is given an embedding(embedding_new
) find the closest 10 embedding from the above 100k
embedding.
The way I am approaching this problem is brute force.
Given a query to find the closest embeddings, I compare embedding_new
with [emb_1, emb_2, ..., emb_100000]
and get the similarity score.
Then I do quicksort of the similarity score to get the top 10
closest embedding.
Is there a better way to achieve this?
2 Likes
Hey there! I don’t think you can get much faster than this, but I may be wrong. Once you have the list of similarities, you can use torch.topk
to obtain the top results of interest.
2 Likes
Yes, please have a look at Nearest Neighbour Search algorithms like Faiss, Hnsw, etc. You can also you libraries like annoy.
Annoy Library: GitHub - spotify/annoy: Approximate Nearest Neighbors in C++/Python optimized for memory usage and loading/saving to disk
The above mentioned algorithms use special data structures like graphs/trees to reduce the run time from linear.
Thanks,
Sharan Babu
2 Likes