What is best way to compute document similarity?

What is the best way to compute document similarity?

I was thinking to use SentenceTransformers for measuring document similarity.
https://www.sbert.net/

Is this is the best way?

Also is there a model to apply contrastive learning for document similarity learning?

2 Likes

Yup, SentenceTransformers can definitely be used for measuring document similarity. Depending on the size of your documents, you might want to choose a model that was tuned for dot-product similarity. (E.g. from the MSMARCO docs: “Models with normalized embeddings will prefer the retrieval of shorter passages, while models tuned for dot-product will prefer the retrieval of longer passages.”) You might also have to split large passages into chunks, otherwise content gets truncated for the models.

For contrastive learning, I think you could use sentence-transformers/all-MiniLM-L6-v2 · Hugging Face with ContrastiveLoss.

We’re actually looking at ways to improve the user experience with SentenceTransformers + Hugging Face, so feel free to post here or message me directly if you have any questions or feedback :slight_smile:

3 Likes