Training for sentence vectors in niche domain

nbroad · August 19, 2020, 7:02pm

Thanks for the reply. I’m a little embarrassed to admit that I didn’t know about sent2vec. I’ll give that a shot and report back on how it did.

LaBSE also looks interesting, but when I read through the paper I noticed this:

We observe that LaBSE performs worse on pairwise English semantic similarity than other
sentence embedding models. This result contrasts with its excellent performance on crosslingual bi-text retrieval. The cross-lingual m-USE model notably achieves the best overall performance, even outperforming SentenceBERT when SentenceBERT is not fine-tuned for the STS task.
We suspect training LaBSE on translation pairs biases the model to excel at detecting meaning equivalence, but not at distinguishing between fine grained degrees of meaning overlap.

Seems like it does far better for similarity across languages, but not within the same language. I appreciate you sharing it, though! I had heard of USE but it looks like I have plenty more to research and try. I think this might just be scenario where I will have to try multiple different models and see what happens. Thanks!

Topic		Replies	Views
Training BERT for word embedding Beginners	17	14341	November 12, 2022
What are some recommended pretrained models for extracting semantic feature on single sentence? Research	4	1480	December 14, 2020
Using MLM and NSP to fine-tune BERT for question answering Models	0	1168	October 11, 2022
Fine-tuning a language model on domain specific embeddings 🤗Transformers	1	1124	November 21, 2023
Domain adaptation with MLM and NSP 🤗Transformers	3	1716	January 18, 2024

Training for sentence vectors in niche domain

Related topics