Training for sentence vectors in niche domain

Hi @BramVanroy,

Thanks for the reply. I’m a little embarrassed to admit that I didn’t know about sent2vec. I’ll give that a shot and report back on how it did.

LaBSE also looks interesting, but when I read through the paper I noticed this:

We observe that LaBSE performs worse on pairwise English semantic similarity than other
sentence embedding models. This result contrasts with its excellent performance on crosslingual bi-text retrieval. The cross-lingual m-USE model notably achieves the best overall performance, even outperforming SentenceBERT when SentenceBERT is not fine-tuned for the STS task.
We suspect training LaBSE on translation pairs biases the model to excel at detecting meaning equivalence, but not at distinguishing between fine grained degrees of meaning overlap.

Seems like it does far better for similarity across languages, but not within the same language. I appreciate you sharing it, though! I had heard of USE but it looks like I have plenty more to research and try. I think this might just be scenario where I will have to try multiple different models and see what happens. Thanks!