What is the best way to compute document similarity?
I was thinking to use SentenceTransformers for measuring document similarity.
https://www.sbert.net/
Is this is the best way?
Also is there a model to apply contrastive learning for document similarity learning?
2 Likes
Yup, SentenceTransformers can definitely be used for measuring document similarity. Depending on the size of your documents, you might want to choose a model that was tuned for dot-product similarity. (E.g. from the MSMARCO docs: “Models with normalized embeddings will prefer the retrieval of shorter passages, while models tuned for dot-product will prefer the retrieval of longer passages.”) You might also have to split large passages into chunks, otherwise content gets truncated for the models.
For contrastive learning, I think you could use sentence-transformers/all-MiniLM-L6-v2 · Hugging Face with ContrastiveLoss.
We’re actually looking at ways to improve the user experience with SentenceTransformers + Hugging Face, so feel free to post here or message me directly if you have any questions or feedback
3 Likes