What is best way to compute document similarity?

What is the best way to compute document similarity?

I was thinking to use SentenceTransformers for measuring document similarity.

Is this is the best way?

Also is there a model to apply contrastive learning for document similarity learning?


Yup, SentenceTransformers can definitely be used for measuring document similarity. Depending on the size of your documents, you might want to choose a model that was tuned for dot-product similarity. (E.g. from the MSMARCO docs: “Models with normalized embeddings will prefer the retrieval of shorter passages, while models tuned for dot-product will prefer the retrieval of longer passages.”) You might also have to split large passages into chunks, otherwise content gets truncated for the models.

For contrastive learning, I think you could use sentence-transformers/all-MiniLM-L6-v2 · Hugging Face with ContrastiveLoss.

We’re actually looking at ways to improve the user experience with SentenceTransformers + Hugging Face, so feel free to post here or message me directly if you have any questions or feedback :slight_smile: